[Gambas-user] scraping html

Fabien Bodard gambas.fr at ...626...
Sat Jul 13 22:52:13 CEST 2013


Like Bruce and Randall said, there is no perfect solution if the structure
of the parsed page change.

so you need some control point before the parsing time to be sure that you
get the good result at the end. If the control show a structure change then
inform the user that the parser need to be revewed.

Now there is two kind of parsing.

Manual: by using instr, and other common text manipulation tools. I use it
when i need to find one data on one line. Because it is more quick than a
DOM tool that need to parse all the html struct before.

But if you need many Info in many place of the web page, the Dom is better
and allow more change in the web page before you need to change the parser
 structure.
Simply because we use Tags and attributes to make the searches.

I will send you an example tomorrow
I too have done a lot of data scrapping for the past few years. I think
picking the right tool for the job will ease your development. I have many
python and Java tools and played with the Gambas parser. But I have found
very little that matches the ease of development I find with Python and
Selenium. Selenium is not just a scrapping tool. In-fact, it wasn't meant
for that at all. It is a browser automation tool and website test
framework. With it, I've had little problem dealing with typical changes in
content. It is also great for comparing the page code sent to different
browsers. BeautifulSoup is great for well structured pages. But once that
structure is lost it often fails. XMLlib and HTMLlib and other Python
modules just don't seem to match the productivity I find with Selenium.

It all comes down to how general do you need your solution to be? Is this a
one-off scrapping or something that you intend to do over a long period of
time?  Do you know Python or Java and can you learn it quickly? Must your
solution scale to large projects, or just this one use?

So answer these questions and then review the options. If it is something
the GAMBAS parse can handle then use it. Or if the page is very stable and
well structured, then write a parser. A basic parser is not difficult to
write. Search the internet for Jack Crenshaw's article on building a simple
parser. However, if the page is complex and this is a long term project,
you may want to consider a more powerful and stable solution.

As Bruce said, put in lots of tests along the way because some pages do
change constantly. Having a reporting system that allows you to locate such
changes is very helpful in a high production environment.

Hope this helps




On Sat, Jul 13, 2013 at 2:33 AM, Fabien Bodard <gambas.fr at ...626...> wrote:

> Send me an exemple url for the page
> Le 13 juil. 2013 10:52, "Shane" <shanep1967 at ...169...> a écrit :
>
> > On 13/07/13 18:33, Fabien Bodard wrote:
> > > There is  a parsing tool in gambas for html.
> > >
> > > Gb.xml.html
> > >
> > > It's our own html dom parser. Itallow to generate well formated html5
> > page
> > > and or parsing existing html pages.
> > >
> > > It's one of the most fast parser I know.
> > >
> > > Look at that ... And if you need I van show yousome examples.
> > > Le 13 juil. 2013 08:20, "Caveat" <Gambas at ...1950...> a écrit :
> > >
> > >> You need to use the right tool for the job.  I find the python tool
> > >> BeautifulSoup one of the best for parsing and extracting data from
web
> > >> pages.
> > >>
> > >> http://www.crummy.com/software/BeautifulSoup/
> > >>
> > >> Kind regards,
> > >> Caveat
> > >>
> > >> On 12/07/13 09:01, Shane wrote:
> > >>> Hi everyone
> > >>>
> > >>> i am trying to get some info from a web page in the format of
> > >>>
> > >>> <div class="result">
> > >>> <div class="col">Text I Want</div>
> > >>> <div class="col">
> > >>> And Some More i Want
> > >>> </div>
> > >>> <div class="col">
> > >>> And The last bit
> > >>> </div>
> > >>> </div>
> > >>>
> > >>> what would be the best way to go about this i have tried a few way
> but
> > i
> > >>> feel there must be an
> > >>> easy way to do this
> > >>>
> > >>> thanks shane
> > >>>
> > >>>
> > >>
> >
>
------------------------------------------------------------------------------
> > >>> See everything from the browser to the database with AppDynamics
> > >>> Get end-to-end visibility with application monitoring from
> AppDynamics
> > >>> Isolate bottlenecks and diagnose root cause in seconds.
> > >>> Start your free trial of AppDynamics Pro today!
> > >>>
> > >>
> >
>
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > >>> _______________________________________________
> > >>> Gambas-user mailing list
> > >>> Gambas-user at lists.sourceforge.net
> > >>> https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >>>
> > >>
> > >>
> > >>
> >
>
------------------------------------------------------------------------------
> > >> See everything from the browser to the database with AppDynamics
> > >> Get end-to-end visibility with application monitoring from
AppDynamics
> > >> Isolate bottlenecks and diagnose root cause in seconds.
> > >> Start your free trial of AppDynamics Pro today!
> > >>
> >
>
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > >> _______________________________________________
> > >> Gambas-user mailing list
> > >> Gambas-user at lists.sourceforge.net
> > >> https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >>
> > >
> >
>
------------------------------------------------------------------------------
> > > See everything from the browser to the database with AppDynamics
> > > Get end-to-end visibility with application monitoring from AppDynamics
> > > Isolate bottlenecks and diagnose root cause in seconds.
> > > Start your free trial of AppDynamics Pro today!
> > >
> >
>
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > > _______________________________________________
> > > Gambas-user mailing list
> > > Gambas-user at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >
> > >
> > thanks everyone for the replys
> > and i would like some examples thanks Fabien
> >
> >
> >
> >
>
------------------------------------------------------------------------------
> > See everything from the browser to the database with AppDynamics
> > Get end-to-end visibility with application monitoring from AppDynamics
> > Isolate bottlenecks and diagnose root cause in seconds.
> > Start your free trial of AppDynamics Pro today!
> >
>
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Gambas-user mailing list
> > Gambas-user at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/gambas-user
> >
>
>
------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
>
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>



--
If you ask me if it can be done. The answer is YES, it can always be done.
The correct questions however are... What will it cost, and how long will
it take?
------------------------------------------------------------------------------
See everything from the browser to the database with AppDynamics
Get end-to-end visibility with application monitoring from AppDynamics
Isolate bottlenecks and diagnose root cause in seconds.
Start your free trial of AppDynamics Pro today!
http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
_______________________________________________
Gambas-user mailing list
Gambas-user at lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/gambas-user



More information about the User mailing list