[Gambas-user] How to Disassemble XML/HTML

Wed Aug 19 08:31:45 CEST 2020

On 10/08/2020 18:09, user-request at lists.gambas-basic.org wrote:
> On 8/10/20 3:48 AM, John Rose wrote:
>> What you have there is JSON embedded in a <script> tag inside HTML.
>>> Thankfully, this data has its own <script> tag which is appropriately
>>> type'd as application/ld+json, so it is not so hard to extract using
>>> gb.xml.html and gb.web:
>>>
>>>    Dim h As New HtmlDocument(your file here...)
>>>    Dim el As XmlElement
>>>    Dim data As New Collection[]
>>>    For Each el In h.GetElementsByTagName("script")
>>>      If el.Attributes["type"] <> "application/ld+json" Then Continue
>>>      data.Add(JSON.Decode(el.TextContent))
>>>    Next
>>>
>>
>> Thanks, Tobias, for your code. I tried the above code (with the 
>> Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html 
>> included: was that overkill?) and it produced a large file named 
>> HTMLandXML.txt I've extracted some of it to an attached file named 
>> PartOfHTMLandXML.txt However, I have no idea of what to do with it. 
>> There are largish blocks of 'code' for each TV programme. Some sample 
>> lines which show the (emboldened) data that I want to obtain are:
>> line 212:  <h3 class="programme__titles"><a 
>> href="*https://www.bbc.co.uk/programmes/m00049tf*"
>> lines 214-215:   ><span class="programme__title delta"><span>*Have I 
>> Got a Bit More News for You*</span></span><span 
>> class="hidden">—</span><span class="programme__subtitle 
>> centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
>> lines 219-220:  <abbr title="*Episode 2 of 9*"><span 
>> datatype="xsd:int">2</span>/<span 
>> class="programme__groupsize">9</span></abbr> <span>*Guest host Alan 
>> Johnson joins Ian Hislop and Paul Merton for the topical news 
>> quiz.*</span>
>> I have no idea how to extract this data from the file. Which Gambas 
>> Component should I use? Is there an example/tutorial/book on how to 
>> extract this data in Gambas? I am a real newbie to HTML & XML etc!
>
> What Tobi gave you is a very simple method of extracting the data. All 
> that is needed now is to figure out exactly how to retrieve the 
> relevant data.
>
> I've attached a modified version of your program. There are now three 
> buttons along with their event handlers and two additional 
> subroutines. Plus, I have modified the Extract subroutine to show the 
> decoded JSON data. For convenience, I moved hData to the top of the 
> class file as a global variable.
>
> When you click the Extract button, you will see that there are two 
> collections that have been added to the hData Collection[] (array of 
> collections). The second one appears to contain the data you want in 
> the "@graph" element which is another array of collections.
>
> The episode data can be accessed from hData by specifying the second 
> element and then the "@graph" element, ie:
> hData[1]["@graph"] 

Thanks, Lee, for the above *invaluable* help. I don't pretend to fully 
understand it but it works.

I have some questions:

1. Can you recommend a printed book and/or online tutorial to help me 
understand your coding (in the routines processing the HTMLandXML data 
such as what the "@graph" element is) and the concepts behind it? Please 
remember that I'm a newbie to HTML& XML etc.

2. Are all of the following Gambas Components required in the attached 
httpClientExtra app (your modified httpClient app slightly changed by 
me): gb.net, gb.net.curl b.web, gb.xml, gb.xml.html?

3. Is there a Gambas component and/or standard coding to extract values 
from the Episodes information? I'm thinking of the identifier, 
episodeNumber, description, datePublished, name & url fields. I'd like 
to extract them into the corresponding aIdentifier, aEpisodeNumber, 
aDescription, aDatePublished, aName & aURL Gambas string arrays, for 
each Episode's set of data. Obviously I could code this myself, but it 
would be nice if there are already routine(s) written to do this kind of 
thing.

4. What coding is required to put the partOfSeries & partOfSeason 
sections (from the Prettified JSON data) immediately after the episode 
data for each Episode in the Episodes text & file?
Similar to 3, I would like to also extract some fields (description & 
name) in the partOfSeries section and some data (name) in the 
partOfSeason section for each TVEpisode in the Prettified JSON. For 
example, the values from the lines:
/description -> Series exploring behind the scenes at Longleat Estate 
and Safari Park//
//name -> Animal Park//
//name -> Summer 2020/
in this part of Prettified JSON :
/@type -> TVEpisode//
//    identifier -> m000lwqj//
//    episodeNumber -> 1//
//    description -> Kate and Ben return to Longleat just as the 
Covid-19 pandemic forces the park to close.//
//    datePublished -> 2020-08-17//
//    image -> https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg//
//    name -> Episode 1//
//    url -> https://www.bbc.co.uk/programmes/m000lwqj//
//    partOfSeries://
//      @type -> TVSeries//
//      image -> https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg//
//      description -> Series exploring behind the scenes at Longleat 
Estate and Safari Park//
//      identifier -> b006w6ns//
//      name -> Animal Park//
//      url -> https://www.bbc.co.uk/programmes/b006w6ns//
//    partOfSeason://
//      @type -> TVSeason//
//      position -> 29//
//      identifier -> m000lwk9//
//      name -> Summer 2020//
/

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200819/911dbc3c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: httpClientExtra.tar.gz
Type: application/gzip
Size: 13465 bytes
Desc: not available
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200819/911dbc3c/attachment-0001.gz>