[Gambas-user] How to Disassemble XML/HTML

T Lee Davidson t.lee.davidson at gmail.com
Wed Aug 19 16:18:30 CEST 2020


On 8/19/20 2:31 AM, John Rose wrote:
> I have some questions:
> 
> 1. Can you recommend a printed book and/or online tutorial to help me understand your coding (in the routines processing the 
> HTMLandXML data such as what the "@graph" element is) and the concepts behind it? Please remember that I'm a newbie to HTML& XML 
> etc.
> 
> 2. Are all of the following Gambas Components required in the attached httpClientExtra app (your modified httpClient app 
> slightly changed by me): gb.net, gb.net.curl b.web, gb.xml, gb.xml.html?
> 
> 3. Is there a Gambas component and/or standard coding to extract values from the Episodes information? I'm thinking of the 
> identifier, episodeNumber, description, datePublished, name & url fields. I'd like to extract them into the corresponding 
> aIdentifier, aEpisodeNumber, aDescription, aDatePublished, aName & aURL Gambas string arrays, for each Episode's set of data. 
> Obviously I could code this myself, but it would be nice if there are already routine(s) written to do this kind of thing.
> 
> 4. What coding is required to put the partOfSeries & partOfSeason sections (from the Prettified JSON data) immediately after the 
> episode data for each Episode in the Episodes text & file?
> Similar to 3, I would like to also extract some fields (description & name) in the partOfSeries section and some data (name) in 
> the partOfSeason section for each TVEpisode in the Prettified JSON. For example, the values from the lines:
> /description -> Series exploring behind the scenes at Longleat Estate and Safari Park//
> //name -> Animal Park//
> //name -> Summer 2020/
> in this part of Prettified JSON :
> /@type -> TVEpisode//
> //    identifier -> m000lwqj//
> //    episodeNumber -> 1//
> //    description -> Kate and Ben return to Longleat just as the Covid-19 pandemic forces the park to close.//
> //    datePublished -> 2020-08-17//
> //    image -> https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg//
> //    name -> Episode 1//
> //    url -> https://www.bbc.co.uk/programmes/m000lwqj//
> //    partOfSeries://
> //      @type -> TVSeries//
> //      image -> https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg//
> //      description -> Series exploring behind the scenes at Longleat Estate and Safari Park//
> //      identifier -> b006w6ns//
> //      name -> Animal Park//
> //      url -> https://www.bbc.co.uk/programmes/b006w6ns//
> //    partOfSeason://
> //      @type -> TVSeason//
> //      position -> 29//
> //      identifier -> m000lwk9//
> //      name -> Summer 2020//
> /

To answer your #2 question: gb.net.curl provides httpClient (which the app uses) and requires gb.net. So both are required. 
gb.xml.html provides HtmlDocument (which the app uses) and requires gb.xml. So both those are also required. gb.web provides the 
JSON.Decode function (which the app uses). gb.util.web also provides the JSON.Decode function. So one, or the other, is required.

For your question #1: The code Tobi provided loads the web page into an HtmlDocument object and then extracts the embedded JSON 
data, and with JSON.Decode, converts it into a Gambas representation (ie. Gambas datatypes) of the JSON data. So, you're no 
longer working with HTML/XML. You're working with Gambas datatypes representing the JSON data from the web page.

Therefore, you should focus on understanding JSON.
https://www.json.org/json-en.html
https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON

#3: The app already uses the component(s) necessary to extract the values you're wanting. As for standard coding to do that, it 
depends on exactly what you mean by that. You need to determine the actual structure of the data so you know how to reference 
whatever particular element contains the info you want to extract. Then you can use the Gambas representation of that JSON data 
to retrieve the info from a standard Gambas datatype which in this case is hData as a multidimensional array of collections. Not 
clear? See #4.

#4: The reason I 'prettified' the data with indentation is to show the additional (sub-)dimensions. To directly access, for 
example, the type and description of a partOfSeries we would use:
hData[1]["@graph"]["partOfSeries"]["@type"] , and
hData[1]["@graph"]["partOfSeries"]["description"]

Since we can see that partOfSeries is a single-dimensional collection containing only string values, we can easily enumerate 
over it:
For Each sElement as String in hData[1]["@graph"]["partOfSeries"]
   Print sElement
Next

For one more example, to directly access the broadcaster's legal name, we would use:
hData[1]["@graph"]["publication"]["publishedOn"]["broadcaster"]["legalName"]

Now since the "publication" element is multidimensional, enumerating over it and printing the value of its elements would cause 
an error when trying to print the value of "publishedOn" which is itself another Collection[]. This error could be prevented if 
we check the type of each element [with TypeOf(), Object.Type(), or Object.Is()] and do not try to print anything that is not a 
string.

It may be easier for you to see the distinction of the sub-dimensions if you set iTabWidth at line 101 to 4.


-- 
Lee


More information about the User mailing list