<html>
<head>
<meta http-equiv="Content-Type" content="text/html;
charset=windows-1252">
</head>
<body>
<div class="moz-cite-prefix">On 10/08/2020 18:09,
<a class="moz-txt-link-abbreviated" href="mailto:user-request@lists.gambas-basic.org">user-request@lists.gambas-basic.org</a> wrote:<br>
</div>
<blockquote type="cite"
cite="mid:mailman.0.1597079342.21172.user@lists.gambas-basic.org">On
8/10/20 3:48 AM, John Rose wrote:
<br>
<blockquote type="cite" style="color: #000000;">What you have
there is JSON embedded in a <script> tag inside HTML.
<br>
<blockquote type="cite" style="color: #000000;">Thankfully, this
data has its own <script> tag which is appropriately
<br>
type'd as application/ld+json, so it is not so hard to extract
using
<br>
gb.xml.html and gb.web:
<br>
<br>
Dim h As New HtmlDocument(your file here...)
<br>
Dim el As XmlElement
<br>
Dim data As New Collection[]
<br>
For Each el In h.GetElementsByTagName("script")
<br>
If el.Attributes["type"] <> "application/ld+json"
Then Continue
<br>
data.Add(JSON.Decode(el.TextContent))
<br>
Next
<br>
<br>
</blockquote>
<br>
Thanks, Tobias, for your code. I tried the above code (with the
Gambas Components gb.net, gb.net.curl, gb.web, gb.xml,
gb.xml.html included: was that overkill?) and it produced a
large file named HTMLandXML.txt I've extracted some of it to an
attached file named PartOfHTMLandXML.txt However, I have no idea
of what to do with it. There are largish blocks of 'code' for
each TV programme. Some sample lines which show the (emboldened)
data that I want to obtain are:
<br>
line 212: <h3 class="programme__titles"><a href="<b
class="moz-txt-star"><span class="moz-txt-tag">*</span><a class="moz-txt-link-freetext" href="https://www.bbc.co.uk/programmes/m00049tf">https://www.bbc.co.uk/programmes/m00049tf</a><span
class="moz-txt-tag">*</span></b>"
<br>
lines 214-215: ><span class="programme__title
delta"><span><b class="moz-txt-star"><span
class="moz-txt-tag">*</span>Have I Got a Bit More News for
You<span class="moz-txt-tag">*</span></b></span></span><span
class="hidden">—</span><span
class="programme__subtitle centi"><span>Series
57</span>, <span>*Episode
2*</span></span></a></h3>
<br>
lines 219-220: <abbr title="*Episode 2 of 9*"><span
datatype="xsd:int">2</span>/<span
class="programme__groupsize">9</span></abbr>
<span>*Guest host Alan Johnson joins Ian Hislop and Paul
Merton for the topical news quiz.*</span>
<br>
I have no idea how to extract this data from the file. Which
Gambas Component should I use? Is there an example/tutorial/book
on how to extract this data in Gambas? I am a real newbie to
HTML & XML etc!
<br>
</blockquote>
<br>
What Tobi gave you is a very simple method of extracting the data.
All that is needed now is to figure out exactly how to retrieve
the relevant data.
<br>
<br>
I've attached a modified version of your program. There are now
three buttons along with their event handlers and two additional
subroutines. Plus, I have modified the Extract subroutine to show
the decoded JSON data. For convenience, I moved hData to the top
of the class file as a global variable.
<br>
<br>
When you click the Extract button, you will see that there are two
collections that have been added to the hData Collection[] (array
of collections). The second one appears to contain the data you
want in the "@graph" element which is another array of
collections.
<br>
<br>
The episode data can be accessed from hData by specifying the
second element and then the "@graph" element, ie:
<br>
hData[1]["@graph"]
</blockquote>
<p>Thanks, Lee, for the above <b>invaluable</b> help. I don't
pretend to fully understand it but it works.</p>
<p> I have some questions:</p>
<p>1. Can you recommend a printed book and/or online tutorial to
help me understand your coding (in the routines processing the
HTMLandXML data such as what the "@graph" element is) and the
concepts behind it? Please remember that I'm a newbie to HTML&
XML etc.<br>
</p>
<p>2. Are all of the following Gambas Components required in the
attached httpClientExtra app (your modified httpClient app
slightly changed by me): gb.net, gb.net.curl b.web, gb.xml,
gb.xml.html?<br>
</p>
<p>3. Is there a Gambas component and/or standard coding to extract
values from the Episodes information? I'm thinking of the
identifier, episodeNumber, description, datePublished, name &
url fields. I'd like to extract them into the corresponding
aIdentifier, aEpisodeNumber, aDescription, aDatePublished, aName
& aURL Gambas string arrays, for each Episode's set of data.
Obviously I could code this myself, but it would be nice if there
are already routine(s) written to do this kind of thing.<br>
</p>
<p>4. What coding is required to put the partOfSeries &
partOfSeason sections (from the Prettified JSON data) immediately
after the episode data for each Episode in the Episodes text &
file?<br>
Similar to 3, I would like to also extract some fields
(description & name) in the partOfSeries section and some data
(name) in the partOfSeason section for each TVEpisode in the
Prettified JSON. For example, the values from the lines:<br>
<i>description -> Series exploring behind the scenes at
Longleat Estate and Safari Park</i><i><br>
</i><i>name -> Animal Park</i><i><br>
</i><i>name -> Summer 2020</i><br>
in this part of Prettified JSON :<br>
<i>@type -> TVEpisode</i><i><br>
</i><i> identifier -> m000lwqj</i><i><br>
</i><i> episodeNumber -> 1</i><i><br>
</i><i> description -> Kate and Ben return to Longleat just
as the Covid-19 pandemic forces the park to close.</i><i><br>
</i><i> datePublished -> 2020-08-17</i><i><br>
</i><i> image ->
<a class="moz-txt-link-freetext" href="https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg">https://ichef.bbci.co.uk/images/ic/480xn/p08n899w.jpg</a></i><i><br>
</i><i> name -> Episode 1</i><i><br>
</i><i> url -> <a class="moz-txt-link-freetext" href="https://www.bbc.co.uk/programmes/m000lwqj">https://www.bbc.co.uk/programmes/m000lwqj</a></i><i><br>
</i><i> partOfSeries:</i><i><br>
</i><i> @type -> TVSeries</i><i><br>
</i><i> image ->
<a class="moz-txt-link-freetext" href="https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg">https://ichef.bbci.co.uk/images/ic/480xn/p07jtz7g.jpg</a></i><i><br>
</i><i> description -> Series exploring behind the scenes
at Longleat Estate and Safari Park</i><i><br>
</i><i> identifier -> b006w6ns</i><i><br>
</i><i> name -> Animal Park</i><i><br>
</i><i> url -> <a class="moz-txt-link-freetext" href="https://www.bbc.co.uk/programmes/b006w6ns">https://www.bbc.co.uk/programmes/b006w6ns</a></i><i><br>
</i><i> partOfSeason:</i><i><br>
</i><i> @type -> TVSeason</i><i><br>
</i><i> position -> 29</i><i><br>
</i><i> identifier -> m000lwk9</i><i><br>
</i><i> name -> Summer 2020</i><i><br>
</i><br>
<br>
</p>
</body>
</html>