[Gambas-user] How to Disassemble XML/HTML
John Rose
john.aaron.rose at mailbox.org
Fri Aug 21 20:46:45 CEST 2020
On 20/08/2020 11:12, user-request at lists.gambas-basic.org wrote:
> Subject:
> Re: [Gambas-user] How to Disassemble XML/HTML
> From:
> T Lee Davidson <t.lee.davidson at gmail.com>
> Date:
> 19/08/2020, 15:18
>
> To:
> user at lists.gambas-basic.org
>
>
> On 8/19/20 2:31 AM, John Rose wrote:
>> I have some questions:
>>
>> 1. Can you recommend a printed book and/or online tutorial to help me
>> understand your coding (in the routines processing the HTMLandXML
>> data such as what the "@graph" element is) and the concepts behind
>> it? Please remember that I'm a newbie to HTML& XML etc.
>>
>> 2. Are all of the following Gambas Components required in the
>> attached httpClientExtra app (your modified httpClient app slightly
>> changed by me): gb.net, gb.net.curl b.web, gb.xml, gb.xml.html?
>>
>> 3. Is there a Gambas component and/or standard coding to extract
>> values from the Episodes information? I'm thinking of the identifier,
>> episodeNumber, description, datePublished, name & url fields. I'd
>> like to extract them into the corresponding aIdentifier,
>> aEpisodeNumber, aDescription, aDatePublished, aName & aURL Gambas
>> string arrays, for each Episode's set of data. Obviously I could code
>> this myself, but it would be nice if there are already routine(s)
>> written to do this kind of thing.
>>
>> 4. What coding is required to put the partOfSeries & partOfSeason
>> sections (from the Prettified JSON data) immediately after the
>> episode data for each Episode in the Episodes text & file?
>> Similar to 3, I would like to also extract some fields (description &
>> name) in the partOfSeries section and some data (name) in the
>> partOfSeason section for each TVEpisode in the Prettified JSON. For
>> example, the values from the lines:
>> /description -> Series exploring behind the scenes at Longleat Estate
>> and Safari Park//
>> //name -> Animal Park//
>> //name -> Summer 2020/
>> in this part of Prettified JSON :
>> /@type -> TVEpisode//
>> // identifier -> m000lwqj//
>> // episodeNumber -> 1//
>> // description -> Kate and Ben return to Longleat just as the
>> Covid-19 pandemic forces the park to close.//
>> // datePublished -> 2020-08-17//
>> // image -> https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg//
>> // name -> Episode 1//
>> // url -> https://www.bbc.co.uk/programmes/m000lwqj//
>> // partOfSeries://
>> // @type -> TVSeries//
>> // image -> https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg//
>> // description -> Series exploring behind the scenes at Longleat
>> Estate and Safari Park//
>> // identifier -> b006w6ns//
>> // name -> Animal Park//
>> // url -> https://www.bbc.co.uk/programmes/b006w6ns//
>> // partOfSeason://
>> // @type -> TVSeason//
>> // position -> 29//
>> // identifier -> m000lwk9//
>> // name -> Summer 2020//
>> /
>
> To answer your #2 question: gb.net.curl provides httpClient (which the
> app uses) and requires gb.net. So both are required. gb.xml.html
> provides HtmlDocument (which the app uses) and requires gb.xml. So
> both those are also required. gb.web provides the JSON.Decode function
> (which the app uses). gb.util.web also provides the JSON.Decode
> function. So one, or the other, is required.
>
> For your question #1: The code Tobi provided loads the web page into
> an HtmlDocument object and then extracts the embedded JSON data, and
> with JSON.Decode, converts it into a Gambas representation (ie. Gambas
> datatypes) of the JSON data. So, you're no longer working with
> HTML/XML. You're working with Gambas datatypes representing the JSON
> data from the web page.
>
> Therefore, you should focus on understanding JSON.
> https://www.json.org/json-en.html
> https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON
>
> #3: The app already uses the component(s) necessary to extract the
> values you're wanting. As for standard coding to do that, it depends
> on exactly what you mean by that. You need to determine the actual
> structure of the data so you know how to reference whatever particular
> element contains the info you want to extract. Then you can use the
> Gambas representation of that JSON data to retrieve the info from a
> standard Gambas datatype which in this case is hData as a
> multidimensional array of collections. Not clear? See #4.
>
> #4: The reason I 'prettified' the data with indentation is to show the
> additional (sub-)dimensions. To directly access, for example, the type
> and description of a partOfSeries we would use:
> hData[1]["@graph"]["partOfSeries"]["@type"] , and
> hData[1]["@graph"]["partOfSeries"]["description"]
>
> Since we can see that partOfSeries is a single-dimensional collection
> containing only string values, we can easily enumerate over it:
> For Each sElement as String in hData[1]["@graph"]["partOfSeries"]
> Print sElement
> Next
>
> For one more example, to directly access the broadcaster's legal name,
> we would use:
> hData[1]["@graph"]["publication"]["publishedOn"]["broadcaster"]["legalName"]
>
>
> Now since the "publication" element is multidimensional, enumerating
> over it and printing the value of its elements would cause an error
> when trying to print the value of "publishedOn" which is itself
> another Collection[]. This error could be prevented if we check the
> type of each element [with TypeOf(), Object.Type(), or Object.Is()]
> and do not try to print anything that is not a string.
>
> It may be easier for you to see the distinction of the sub-dimensions
> if you set iTabWidth at line 101 to 4.
I've tried to obtain the various field values for some @type elements
and some partOfSeries elements. However it now stops , I think, at the
line shown in the code below. I think it might be due to the first
Episode 'extracted' having no partOfSeries section. How do I test for that?
aSeriesName.Add(cEpisode["partOfSeries"]["name"])
All the aEpisode... & aSeries... are defined as global arrays of strings
e.g. Private aSeriesName As String[]
I still don't fully understand this extraction of JSON field values. But
I will take a look at the above 2 URLs of JSON information.
Private Procedure ExtractEpisodes()
Dim caEpisodes As Collection[]
Dim cEpisode As Collection
Dim sTextContent As String
sTextContent = ""
If hData.Count = 0 Then
QuitAfterError("No Episodes in Week " & sWeekNumber, "for " &
sConnectMedium & " " & sConnectChannel)
End If
caEpisodes = hData[1]["@graph"]
For Each cEpisode In caEpisodes
aEpisodeName.Add(cEpisode["name"])
aEpisodeDescription.Add(cEpisode["description"])
aEpisodeDatePublished.Add(cEpisode["datePublished"])
aEpisodeIdentifier.Add(cEpisode["identifier"])
aSeriesName.Add(cEpisode["partOfSeries"]["name"])
aSeriesDescription.Add(cEpisode["partOfSeries"]["description"])
If Left(UCase(cEpisode["partOfSeries"]["name"]), 12) = "LINE OF
DUTY" Then
Print "Line of Duty"
Print "EpisodeName=" & cEpisode["name"]
Print "Episode Description=" & cEpisode["description"]
Print "DatePublished=" & cEpisode["datePublished"]
Print "Identifier=" & cEpisode["identifier"]
Print "SeriesName=" & cEpisode["partOfSeries"]["name"]
Print "SeriesDescription=" & cEpisode["partOfSeries"]["description"]
Endif
For Each sInfo As Variant In cEpisode
If TypeOf(sInfo) <> gb.String Then Continue
sTextContent &= cEpisode.Key & " -> " & sInfo & "\n"
Next
sTextContent &= "\n"
Next
--
John
0044 1902 331266
0044 7476 041418
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200821/bc86010c/attachment.html>
More information about the User
mailing list