[Gambas-user] How to Disassemble XML/HTML

T Lee Davidson t.lee.davidson at gmail.com
Mon Aug 10 19:08:21 CEST 2020


On 8/10/20 3:48 AM, John Rose wrote:
> What you have there is JSON embedded in a <script> tag inside HTML.
>> Thankfully, this data has its own <script> tag which is appropriately
>> type'd as application/ld+json, so it is not so hard to extract using
>> gb.xml.html and gb.web:
>>
>>    Dim h As New HtmlDocument(your file here...)
>>    Dim el As XmlElement
>>    Dim data As New Collection[]
>>    For Each el In h.GetElementsByTagName("script")
>>      If el.Attributes["type"] <> "application/ld+json" Then Continue
>>      data.Add(JSON.Decode(el.TextContent))
>>    Next
>>
> 
> Thanks, Tobias, for your code. I tried the above code (with the Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html included: was that overkill?) and it produced a large file named HTMLandXML.txt I've extracted some of it to an attached file named PartOfHTMLandXML.txt However, I have no idea of what to do with it. There are largish blocks of 'code' for each TV programme. Some sample lines which show the (emboldened) data that I want to obtain are:
> line 212:  <h3 class="programme__titles"><a href="*https://www.bbc.co.uk/programmes/m00049tf*"
> lines 214-215:   ><span class="programme__title delta"><span>*Have I Got a Bit More News for You*</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
> lines 219-220:  <abbr title="*Episode 2 of 9*"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr>                <span>*Guest host Alan Johnson joins Ian Hislop and Paul Merton for the topical news quiz.*</span>
> I have no idea how to extract this data from the file. Which Gambas Component should I use? Is there an example/tutorial/book on how to extract this data in Gambas? I am a real newbie to HTML & XML etc!

What Tobi gave you is a very simple method of extracting the data. All that is needed now is to figure out exactly how to 
retrieve the relevant data.

I've attached a modified version of your program. There are now three buttons along with their event handlers and two additional 
subroutines. Plus, I have modified the Extract subroutine to show the decoded JSON data. For convenience, I moved hData to the 
top of the class file as a global variable.

When you click the Extract button, you will see that there are two collections that have been added to the hData Collection[] 
(array of collections). The second one appears to contain the data you want in the "@graph" element which is another array of 
collections.

The episode data can be accessed from hData by specifying the second element and then the "@graph" element, ie:
hData[1]["@graph"]


-- 
Lee
-------------- next part --------------
A non-text attachment was scrubbed...
Name: httpClient-0.0.1.tar.gz
Type: application/gzip
Size: 14230 bytes
Desc: not available
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200810/1c218531/attachment-0001.gz>


More information about the User mailing list