[Gambas-user] How to Disassemble XML/HTML

Fri Aug 21 20:46:45 CEST 2020

On 20/08/2020 11:12, user-request at lists.gambas-basic.org wrote:
> Subject:
> Re: [Gambas-user] How to Disassemble XML/HTML
> From:
> T Lee Davidson <t.lee.davidson at gmail.com>
> Date:
> 19/08/2020, 15:18
>
> To:
> user at lists.gambas-basic.org
>
>
> On 8/19/20 2:31 AM, John Rose wrote:
>> I have some questions:
>>
>> 1. Can you recommend a printed book and/or online tutorial to help me 
>> understand your coding (in the routines processing the HTMLandXML 
>> data such as what the "@graph" element is) and the concepts behind 
>> it? Please remember that I'm a newbie to HTML& XML etc.
>>
>> 2. Are all of the following Gambas Components required in the 
>> attached httpClientExtra app (your modified httpClient app slightly 
>> changed by me): gb.net, gb.net.curl b.web, gb.xml, gb.xml.html?
>>
>> 3. Is there a Gambas component and/or standard coding to extract 
>> values from the Episodes information? I'm thinking of the identifier, 
>> episodeNumber, description, datePublished, name & url fields. I'd 
>> like to extract them into the corresponding aIdentifier, 
>> aEpisodeNumber, aDescription, aDatePublished, aName & aURL Gambas 
>> string arrays, for each Episode's set of data. Obviously I could code 
>> this myself, but it would be nice if there are already routine(s) 
>> written to do this kind of thing.
>>
>> 4. What coding is required to put the partOfSeries & partOfSeason 
>> sections (from the Prettified JSON data) immediately after the 
>> episode data for each Episode in the Episodes text & file?
>> Similar to 3, I would like to also extract some fields (description & 
>> name) in the partOfSeries section and some data (name) in the 
>> partOfSeason section for each TVEpisode in the Prettified JSON. For 
>> example, the values from the lines:
>> /description -> Series exploring behind the scenes at Longleat Estate 
>> and Safari Park//
>> //name -> Animal Park//
>> //name -> Summer 2020/
>> in this part of Prettified JSON :
>> /@type -> TVEpisode//
>> //    identifier -> m000lwqj//
>> //    episodeNumber -> 1//
>> //    description -> Kate and Ben return to Longleat just as the 
>> Covid-19 pandemic forces the park to close.//
>> //    datePublished -> 2020-08-17//
>> //    image -> https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg//
>> //    name -> Episode 1//
>> //    url -> https://www.bbc.co.uk/programmes/m000lwqj//
>> //    partOfSeries://
>> //      @type -> TVSeries//
>> //      image -> https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg//
>> //      description -> Series exploring behind the scenes at Longleat 
>> Estate and Safari Park//
>> //      identifier -> b006w6ns//
>> //      name -> Animal Park//
>> //      url -> https://www.bbc.co.uk/programmes/b006w6ns//
>> //    partOfSeason://
>> //      @type -> TVSeason//
>> //      position -> 29//
>> //      identifier -> m000lwk9//
>> //      name -> Summer 2020//
>> /
>
> To answer your #2 question: gb.net.curl provides httpClient (which the 
> app uses) and requires gb.net. So both are required. gb.xml.html 
> provides HtmlDocument (which the app uses) and requires gb.xml. So 
> both those are also required. gb.web provides the JSON.Decode function 
> (which the app uses). gb.util.web also provides the JSON.Decode 
> function. So one, or the other, is required.
>
> For your question #1: The code Tobi provided loads the web page into 
> an HtmlDocument object and then extracts the embedded JSON data, and 
> with JSON.Decode, converts it into a Gambas representation (ie. Gambas 
> datatypes) of the JSON data. So, you're no longer working with 
> HTML/XML. You're working with Gambas datatypes representing the JSON 
> data from the web page.
>
> Therefore, you should focus on understanding JSON.
> https://www.json.org/json-en.html
> https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON
>
> #3: The app already uses the component(s) necessary to extract the 
> values you're wanting. As for standard coding to do that, it depends 
> on exactly what you mean by that. You need to determine the actual 
> structure of the data so you know how to reference whatever particular 
> element contains the info you want to extract. Then you can use the 
> Gambas representation of that JSON data to retrieve the info from a 
> standard Gambas datatype which in this case is hData as a 
> multidimensional array of collections. Not clear? See #4.
>
> #4: The reason I 'prettified' the data with indentation is to show the 
> additional (sub-)dimensions. To directly access, for example, the type 
> and description of a partOfSeries we would use:
> hData[1]["@graph"]["partOfSeries"]["@type"] , and
> hData[1]["@graph"]["partOfSeries"]["description"]
>
> Since we can see that partOfSeries is a single-dimensional collection 
> containing only string values, we can easily enumerate over it:
> For Each sElement as String in hData[1]["@graph"]["partOfSeries"]
>   Print sElement
> Next
>
> For one more example, to directly access the broadcaster's legal name, 
> we would use:
> hData[1]["@graph"]["publication"]["publishedOn"]["broadcaster"]["legalName"] 
>
>
> Now since the "publication" element is multidimensional, enumerating 
> over it and printing the value of its elements would cause an error 
> when trying to print the value of "publishedOn" which is itself 
> another Collection[]. This error could be prevented if we check the 
> type of each element [with TypeOf(), Object.Type(), or Object.Is()] 
> and do not try to print anything that is not a string.
>
> It may be easier for you to see the distinction of the sub-dimensions 
> if you set iTabWidth at line 101 to 4.

I've tried to obtain the various field values for some @type elements 
and some partOfSeries elements. However it now stops , I think, at the 
line shown in the code below. I think it might be due to the first 
Episode 'extracted' having no partOfSeries section.  How do I test for that?

aSeriesName.Add(cEpisode["partOfSeries"]["name"])

All the aEpisode... & aSeries... are defined as global arrays of strings 
e.g. Private aSeriesName As String[]

I still don't fully understand this extraction of JSON field values. But 
I will take a look at the above 2 URLs of JSON information.

Private Procedure ExtractEpisodes()
   Dim caEpisodes As Collection[]
   Dim cEpisode As Collection
   Dim sTextContent As String
   sTextContent = ""
   If hData.Count = 0 Then
     QuitAfterError("No Episodes in Week " & sWeekNumber, "for " & 
sConnectMedium & " " & sConnectChannel)
   End If
   caEpisodes = hData[1]["@graph"]
   For Each cEpisode In caEpisodes
     aEpisodeName.Add(cEpisode["name"])
     aEpisodeDescription.Add(cEpisode["description"])
     aEpisodeDatePublished.Add(cEpisode["datePublished"])
     aEpisodeIdentifier.Add(cEpisode["identifier"])
     aSeriesName.Add(cEpisode["partOfSeries"]["name"])
aSeriesDescription.Add(cEpisode["partOfSeries"]["description"])
     If Left(UCase(cEpisode["partOfSeries"]["name"]), 12) = "LINE OF 
DUTY" Then
       Print "Line of Duty"
       Print "EpisodeName=" & cEpisode["name"]
       Print "Episode Description=" & cEpisode["description"]
       Print "DatePublished=" & cEpisode["datePublished"]
       Print "Identifier=" & cEpisode["identifier"]
       Print "SeriesName=" & cEpisode["partOfSeries"]["name"]
       Print "SeriesDescription=" & cEpisode["partOfSeries"]["description"]
     Endif
     For Each sInfo As Variant In cEpisode
       If TypeOf(sInfo) <> gb.String Then Continue
       sTextContent &= cEpisode.Key & " -> " & sInfo & "\n"
     Next
     sTextContent &= "\n"
   Next

-- 
John
0044 1902 331266
0044 7476 041418

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200821/bc86010c/attachment.html>