[Gambas-user] scraping html
Randall Morgan
rmorgan62 at ...626...
Sat Jul 13 14:45:19 CEST 2013
I too have done a lot of data scrapping for the past few years. I think
picking the right tool for the job will ease your development. I have many
python and Java tools and played with the Gambas parser. But I have found
very little that matches the ease of development I find with Python and
Selenium. Selenium is not just a scrapping tool. In-fact, it wasn't meant
for that at all. It is a browser automation tool and website test
framework. With it, I've had little problem dealing with typical changes in
content. It is also great for comparing the page code sent to different
browsers. BeautifulSoup is great for well structured pages. But once that
structure is lost it often fails. XMLlib and HTMLlib and other Python
modules just don't seem to match the productivity I find with Selenium.
It all comes down to how general do you need your solution to be? Is this a
one-off scrapping or something that you intend to do over a long period of
time? Do you know Python or Java and can you learn it quickly? Must your
solution scale to large projects, or just this one use?
So answer these questions and then review the options. If it is something
the GAMBAS parse can handle then use it. Or if the page is very stable and
well structured, then write a parser. A basic parser is not difficult to
write. Search the internet for Jack Crenshaw's article on building a simple
parser. However, if the page is complex and this is a long term project,
you may want to consider a more powerful and stable solution.
As Bruce said, put in lots of tests along the way because some pages do
change constantly. Having a reporting system that allows you to locate such
changes is very helpful in a high production environment.
Hope this helps
On Sat, Jul 13, 2013 at 2:33 AM, Fabien Bodard <gambas.fr at ...626...> wrote:
> Send me an exemple url for the page
> Le 13 juil. 2013 10:52, "Shane" <shanep1967 at ...169...> a écrit :
>
> > On 13/07/13 18:33, Fabien Bodard wrote:
> > > There is a parsing tool in gambas for html.
> > >
> > > Gb.xml.html
> > >
> > > It's our own html dom parser. Itallow to generate well formated html5
> > page
> > > and or parsing existing html pages.
> > >
> > > It's one of the most fast parser I know.
> > >
> > > Look at that ... And if you need I van show yousome examples.
> > > Le 13 juil. 2013 08:20, "Caveat" <Gambas at ...1950...> a écrit :
> > >
> > >> You need to use the right tool for the job. I find the python tool
> > >> BeautifulSoup one of the best for parsing and extracting data from web
> > >> pages.
> > >>
> > >> http://www.crummy.com/software/BeautifulSoup/
> > >>
> > >> Kind regards,
> > >> Caveat
> > >>
> > >> On 12/07/13 09:01, Shane wrote:
> > >>> Hi everyone
> > >>>
> > >>> i am trying to get some info from a web page in the format of
> > >>>
> > >>> <div class="result">
> > >>> <div class="col">Text I Want</div>
> > >>> <div class="col">
> > >>> And Some More i Want
> > >>> </div>
> > >>> <div class="col">
> > >>> And The last bit
> > >>> </div>
> > >>> </div>
> > >>>
> > >>> what would be the best way to go about this i have tried a few way
> but
> > i
> > >>> feel there must be an
> > >>> easy way to do this
> > >>>
> > >>> thanks shane
> > >>>
> > >>>
> > >>
> >
> ------------------------------------------------------------------------------
> > >>> See everything from the browser to the database with AppDynamics
> > >>> Get end-to-end visibility with application monitoring from
> AppDynamics
> > >>> Isolate bottlenecks and diagnose root cause in seconds.
> > >>> Start your free trial of AppDynamics Pro today!
> > >>>
> > >>
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > >>> _______________________________________________
> > >>> Gambas-user mailing list
> > >>> Gambas-user at lists.sourceforge.net
> > >>> https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >>>
> > >>
> > >>
> > >>
> >
> ------------------------------------------------------------------------------
> > >> See everything from the browser to the database with AppDynamics
> > >> Get end-to-end visibility with application monitoring from AppDynamics
> > >> Isolate bottlenecks and diagnose root cause in seconds.
> > >> Start your free trial of AppDynamics Pro today!
> > >>
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > >> _______________________________________________
> > >> Gambas-user mailing list
> > >> Gambas-user at lists.sourceforge.net
> > >> https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >>
> > >
> >
> ------------------------------------------------------------------------------
> > > See everything from the browser to the database with AppDynamics
> > > Get end-to-end visibility with application monitoring from AppDynamics
> > > Isolate bottlenecks and diagnose root cause in seconds.
> > > Start your free trial of AppDynamics Pro today!
> > >
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > > _______________________________________________
> > > Gambas-user mailing list
> > > Gambas-user at lists.sourceforge.net
> > > https://lists.sourceforge.net/lists/listinfo/gambas-user
> > >
> > >
> > thanks everyone for the replys
> > and i would like some examples thanks Fabien
> >
> >
> >
> >
> ------------------------------------------------------------------------------
> > See everything from the browser to the database with AppDynamics
> > Get end-to-end visibility with application monitoring from AppDynamics
> > Isolate bottlenecks and diagnose root cause in seconds.
> > Start your free trial of AppDynamics Pro today!
> >
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> > _______________________________________________
> > Gambas-user mailing list
> > Gambas-user at lists.sourceforge.net
> > https://lists.sourceforge.net/lists/listinfo/gambas-user
> >
>
> ------------------------------------------------------------------------------
> See everything from the browser to the database with AppDynamics
> Get end-to-end visibility with application monitoring from AppDynamics
> Isolate bottlenecks and diagnose root cause in seconds.
> Start your free trial of AppDynamics Pro today!
> http://pubads.g.doubleclick.net/gampad/clk?id=48808831&iu=/4140/ostg.clktrk
> _______________________________________________
> Gambas-user mailing list
> Gambas-user at lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/gambas-user
>
--
If you ask me if it can be done. The answer is YES, it can always be done.
The correct questions however are... What will it cost, and how long will
it take?
More information about the User
mailing list