[Gambas-user] How to Disassemble XML/HTML
John Rose
john.aaron.rose at mailbox.org
Wed Aug 19 08:31:45 CEST 2020
On 10/08/2020 18:09, user-request at lists.gambas-basic.org wrote:
> On 8/10/20 3:48 AM, John Rose wrote:
>> What you have there is JSON embedded in a <script> tag inside HTML.
>>> Thankfully, this data has its own <script> tag which is appropriately
>>> type'd as application/ld+json, so it is not so hard to extract using
>>> gb.xml.html and gb.web:
>>>
>>> Dim h As New HtmlDocument(your file here...)
>>> Dim el As XmlElement
>>> Dim data As New Collection[]
>>> For Each el In h.GetElementsByTagName("script")
>>> If el.Attributes["type"] <> "application/ld+json" Then Continue
>>> data.Add(JSON.Decode(el.TextContent))
>>> Next
>>>
>>
>> Thanks, Tobias, for your code. I tried the above code (with the
>> Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html
>> included: was that overkill?) and it produced a large file named
>> HTMLandXML.txt I've extracted some of it to an attached file named
>> PartOfHTMLandXML.txt However, I have no idea of what to do with it.
>> There are largish blocks of 'code' for each TV programme. Some sample
>> lines which show the (emboldened) data that I want to obtain are:
>> line 212: <h3 class="programme__titles"><a
>> href="*https://www.bbc.co.uk/programmes/m00049tf*"
>> lines 214-215: ><span class="programme__title delta"><span>*Have I
>> Got a Bit More News for You*</span></span><span
>> class="hidden">—</span><span class="programme__subtitle
>> centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
>> lines 219-220: <abbr title="*Episode 2 of 9*"><span
>> datatype="xsd:int">2</span>/<span
>> class="programme__groupsize">9</span></abbr> <span>*Guest host Alan
>> Johnson joins Ian Hislop and Paul Merton for the topical news
>> quiz.*</span>
>> I have no idea how to extract this data from the file. Which Gambas
>> Component should I use? Is there an example/tutorial/book on how to
>> extract this data in Gambas? I am a real newbie to HTML & XML etc!
>
> What Tobi gave you is a very simple method of extracting the data. All
> that is needed now is to figure out exactly how to retrieve the
> relevant data.
>
> I've attached a modified version of your program. There are now three
> buttons along with their event handlers and two additional
> subroutines. Plus, I have modified the Extract subroutine to show the
> decoded JSON data. For convenience, I moved hData to the top of the
> class file as a global variable.
>
> When you click the Extract button, you will see that there are two
> collections that have been added to the hData Collection[] (array of
> collections). The second one appears to contain the data you want in
> the "@graph" element which is another array of collections.
>
> The episode data can be accessed from hData by specifying the second
> element and then the "@graph" element, ie:
> hData[1]["@graph"]
Thanks, Lee, for the above *invaluable* help. I don't pretend to fully
understand it but it works.
I have some questions:
1. Can you recommend a printed book and/or online tutorial to help me
understand your coding (in the routines processing the HTMLandXML data
such as what the "@graph" element is) and the concepts behind it? Please
remember that I'm a newbie to HTML& XML etc.
2. Are all of the following Gambas Components required in the attached
httpClientExtra app (your modified httpClient app slightly changed by
me): gb.net, gb.net.curl b.web, gb.xml, gb.xml.html?
3. Is there a Gambas component and/or standard coding to extract values
from the Episodes information? I'm thinking of the identifier,
episodeNumber, description, datePublished, name & url fields. I'd like
to extract them into the corresponding aIdentifier, aEpisodeNumber,
aDescription, aDatePublished, aName & aURL Gambas string arrays, for
each Episode's set of data. Obviously I could code this myself, but it
would be nice if there are already routine(s) written to do this kind of
thing.
4. What coding is required to put the partOfSeries & partOfSeason
sections (from the Prettified JSON data) immediately after the episode
data for each Episode in the Episodes text & file?
Similar to 3, I would like to also extract some fields (description &
name) in the partOfSeries section and some data (name) in the
partOfSeason section for each TVEpisode in the Prettified JSON. For
example, the values from the lines:
/description -> Series exploring behind the scenes at Longleat Estate
and Safari Park//
//name -> Animal Park//
//name -> Summer 2020/
in this part of Prettified JSON :
/@type -> TVEpisode//
// identifier -> m000lwqj//
// episodeNumber -> 1//
// description -> Kate and Ben return to Longleat just as the
Covid-19 pandemic forces the park to close.//
// datePublished -> 2020-08-17//
// image -> https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg//
// name -> Episode 1//
// url -> https://www.bbc.co.uk/programmes/m000lwqj//
// partOfSeries://
// @type -> TVSeries//
// image -> https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg//
// description -> Series exploring behind the scenes at Longleat
Estate and Safari Park//
// identifier -> b006w6ns//
// name -> Animal Park//
// url -> https://www.bbc.co.uk/programmes/b006w6ns//
// partOfSeason://
// @type -> TVSeason//
// position -> 29//
// identifier -> m000lwk9//
// name -> Summer 2020//
/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200819/911dbc3c/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: httpClientExtra.tar.gz
Type: application/gzip
Size: 13465 bytes
Desc: not available
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200819/911dbc3c/attachment-0001.gz>
More information about the User
mailing list