[Gambas-user] scraping html

Sat Jul 13 06:34:45 CEST 2013

On Fri, 2013-07-12 at 17:01 +1000, Shane wrote:
> Hi everyone
> 
> i am trying to get some info from a web page in the format of
> 
> <div class="result">
> <div class="col">Text I Want</div>
> <div class="col">
> And Some More i Want
> </div>
> <div class="col">
> And The last bit
> </div>
> </div>
> 
> what would be the best way to go about this i have tried a few way but i 
> feel there must be an
> easy way to do this
> 
> thanks shane

Shane,

Quite frankly, there is no one easy way to go about this. It depends on
how well structured the data in the web page is.  Also, how likely it is
that the web page will change format.  We scrape upwards of 300 pages
daily and have some fairly mature ways of approaching it.  Here's some
of the techniques we use.

1) Always try and find an XML feed equivalent of the page data.
Sometimes this can be found as a raw feed or sometimes as a hidden feed
to an active page.  Once you get the feed URL and either find or write a
schema then parsing the XML is relatively trivial.

2) If the page is well structured and relatively stable, then the next
best approach I suppose would be to follow Randall's suggestion and
write a HTML DOM parser.  But, if you go down this route, then develop a
"meta" schema for your parser so you can accomodate changes to the page
format and raw HTML with minimal pain.  

3) Sometimes we have fond that it is better to ignore the html
completely and process the page text only.  This is particularly true
for pages that use large, well formed "tables" of data that is unlikely
to change in layout (such as if there is an "industry standard" way of
presenting the data.  I find that the easiest way to get the raw text in
a format that allows reasonable scraping is to use wget, html2text,
links or lynx to download the page as you need it.  The choice of which
downloader to use is dependant on which one can give you the best
"layout" of the text to make parsing easier.  Again, try and develop
some meta-description of the text.

4) Always include code to detect possible page format changes and to
describe exactly which bit of the page is no longer scrapable!  This can
save hours of work when a tiny bit changes and renders your parser
incorrect or unusable.

5) Finally, we have encountered some pages where it appears that the
target texts are seemingly impossible to predict.  For example, one feed
we use randomly inserts advertising data inside the data table rows.
That is, only some of the rows include this extra stuff and some dont.
For these, we have had to resort to "restructuring" the semi-parsed data
and writing it out to an intermediate file.  We then try to
automatically parse that file and if that fails, manually edit the
intermediate file back into a useable format.

Hope this gives you some ideas of how to approach your situation.  Above
all, try and design an approach before leaping into the coding stage.
Several times we have been caught out assuming that a page is "simple"
and have had to go back and rethink the whole design for that feed
because the provider made some change to the page presentation.

regards
Bruce