[Gambas-user] screen scraping web page

T Lee Davidson t.lee.davidson at gmail.com
Thu Oct 20 21:35:16 CEST 2022


On 10/20/22 10:12, KKing wrote:
> Hi,
> I want to grab the exchange rate and time off a web page, any suggestions on how others would go about or advice on why my 
> approach now seems flawed.
> I was just using e.g.
> Shell "curl 'https://uk.finance.yahoo.com/quote/GBPEUR=X'" To strResponse
> on a timer.
> Then searching strResponse for nodes of interest, this seemed to work mostly okay.
> But recently it starts okay, but then after while the content of strResponse seems off, as in time of day and rate is nothing 
> like what it should be, and stays like that for each subsequent curl. If I kill the program and leave for a while it will again 
> start okay and be fine for a while but then suddenly give these off results. I know it is not the website as such (in general) 
> as browser on a seperate computer is showing the site via normal browser and all okay.
> K.

I coded up a short test that shelled out to curl to retrieve the html page once per minute for over an hour. Only once did the 
time given in the response differ from what it should have been; it lagged by one minute.

The headers sent by the server includes this header:
cache-control: private, no-store, no-cache, max-age=0

So, it appears that the Yahoo server is not caching the response. Perhaps your ISP (or some other server or proxy) is caching it.

If that is the case, you might try appending a random query string parameter to the URL so that your ISP will see it as a 
different request and not cache it, eg.:

Shell "curl https://uk.finance.yahoo.com/quote/GBPEUR=X&rnd=" & Str(Rand(100000, 999999)) To sResponse
(Notice I removed the single-quotes as they don't seem to be necessary.)

As an aside, I'm sure you're aware of the HttpClient class. But, in this case, shelling out to curl seems to be simpler.


-- 
Lee



More information about the User mailing list