[Gambas-user] How to Disassemble XML/HTML
John Rose
john.aaron.rose at mailbox.org
Mon Aug 10 09:48:03 CEST 2020
What you have there is JSON embedded in a <script> tag inside HTML.
> Thankfully, this data has its own <script> tag which is appropriately
> type'd as application/ld+json, so it is not so hard to extract using
> gb.xml.html and gb.web:
>
> Dim h As New HtmlDocument(your file here...)
> Dim el As XmlElement
> Dim data As New Collection[]
> For Each el In h.GetElementsByTagName("script")
> If el.Attributes["type"] <> "application/ld+json" Then Continue
> data.Add(JSON.Decode(el.TextContent))
> Next
>
Thanks, Tobias, for your code. I tried the above code (with the Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html included: was that overkill?) and it produced a large file named HTMLandXML.txt I've extracted some of it to an attached file named PartOfHTMLandXML.txt However, I have no idea of what to do with it. There are largish blocks of 'code' for each TV programme. Some sample lines which show the (emboldened) data that I want to obtain are:
line 212: <h3 class="programme__titles"><a href="*https://www.bbc.co.uk/programmes/m00049tf*"
lines 214-215: ><span class="programme__title delta"><span>*Have I Got a Bit More News for You*</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
lines 219-220: <abbr title="*Episode 2 of 9*"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr> <span>*Guest host Alan Johnson joins Ian Hislop and Paul Merton for the
topical news quiz.*</span>
I have no idea how to extract this data from the file. Which Gambas Component should I use? Is there an example/tutorial/book on how to extract this data in Gambas? I am a real newbie to HTML & XML etc!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200810/844f3758/attachment-0001.html>
-------------- next part --------------
</script> </div></div>
<div class="programmes-page text-base programmes-page--flush" role="main">
<div id="programmes-content" class="skip-to-content-destination" tabindex="-1">Main content</div>
<div>
<div class="b-g-p no-margin-vertical islet--vertical br-box-highlight br-highlight-bg-onbg080" id="schedule-header"><div class="grid-wrapper">
<div class="grid 1/2 at bpw2 1/2 at bpe">
<p class="beta">
HD Schedule
</p>
</div>
<div class="grid 1/2 at bpw2 1/2 at bpe">
<div class="text--right at bpw">
<a href="#outlets" class="delta">See other regional BBC One variations</a></div>
</div></div>
<h1>
<span class="visually-hidden">
<span class="outlet">
BBC One HD Schedule
</span>
</span>
<time class="date">
10 - 16 August 2020
</time>
</h1>
</div>
<div class="g-f-l br-box-page">
<div class="week-guide">
<div class="week-guide__table-box hidden visible at bpe">
<table class="week-guide__table br-box-page br-keyline--table">
<thead class="week-guide__table__head g-f-l text-base br-box-highlight">
<tr class="week-guide__table__days-row centi">
<th class="date-list__page week-guide__table__nav br-box-highlight">
<a id="last-week" class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover text--no-ul text--shout" rel="prev" data-href-add-utcoffset="true" href="/schedules/p00fzl6n/2020/w32">
<svg tabindex="-1" aria-hidden="true" focusable="false" class="gelicon gelicon--centi gelicon--leading"><use xlink:href="#gelicon--basics--previous" /></svg>Last
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-0 br-box-page">
<span class="box-link date-list date-list__page--current text--shout">
<span class="date-list__item-line1">Today</span>
<span class="date-list__item-line2">10 Aug</span>
</span>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-1">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/11"
data-href-add-utcoffset="true"
aria-label="Tuesday 11 August"
>
<span class="date-list__item-line1">Tue</span>
<span class="date-list__item-line2">11 Aug</span>
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-2">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/12"
data-href-add-utcoffset="true"
aria-label="Wednesday 12 August"
>
<span class="date-list__item-line1">Wed</span>
<span class="date-list__item-line2">12 Aug</span>
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-3">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/13"
data-href-add-utcoffset="true"
aria-label="Thursday 13 August"
>
<span class="date-list__item-line1">Thu</span>
<span class="date-list__item-line2">13 Aug</span>
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-4">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/14"
data-href-add-utcoffset="true"
aria-label="Friday 14 August"
>
<span class="date-list__item-line1">Fri</span>
<span class="date-list__item-line2">14 Aug</span>
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-5">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/15"
data-href-add-utcoffset="true"
aria-label="Saturday 15 August"
>
<span class="date-list__item-line1">Sat</span>
<span class="date-list__item-line2">15 Aug</span>
</a>
</th>
<th scope="col" class="date-list__page week-guide__table__day day-6">
<a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
href="/schedules/p00fzl6n/2020/08/16"
data-href-add-utcoffset="true"
aria-label="Sunday 16 August"
>
<span class="date-list__item-line1">Sun</span>
<span class="date-list__item-line2">16 Aug</span>
</a>
</th>
<th class="date-list__page box-link week-guide__table__nav br-box-highlight trail">
<a id="next-week" class="br-page-bg-onbg--hover br-page-linkhover-ontext--hover text--no-ul text--shout" rel="next" data-href-add-utcoffset="true" href="/schedules/p00fzl6n/2020/w34">
Next<svg tabindex="-1" aria-hidden="true" focusable="false" class="gelicon gelicon--centi gelicon--trailing"><use xlink:href="#gelicon--basics--next" /></svg>
</a>
</th>
</tr>
</thead>
<tbody class="week-guide__table__body" data-page-time="2020/08/10">
<tr class="week-guide__table__hour-row">
<th class="hour" scope="row" data-timezone="true" content="2020-08-10T00:00:00+01:00">
<span class='week-guide__table__hour timezone--time'>00:00</span>
</th>
<td class="day-0 br-box-page">
<ol class="list-unstyled">
<li class="week-guide__table__item">
<div class="broadcast broadcast--has-ended block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-10T00:30:00+01:00">
<span class="timezone--time">00:30</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqfy">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqfy"
class="br-blocklink__link block-link__target"
aria-label="10 Aug 00:30: Weather for the Week Ahead, 10/08/2020"
><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>10/08/2020</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<span>Detailed weather forecast.</span>
</p>
</div>
</div>
</div>
</div>
</div>
</li>
<li class="week-guide__table__item">
<div class="broadcast broadcast--has-ended block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-10T00:35:00+01:00">
<span class="timezone--time">00:35</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqg0">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqg0"
class="br-blocklink__link block-link__target"
aria-label="10 Aug 00:35: Joins BBC News, 10/08/2020"
><span class="programme__title delta"><span>Joins BBC News</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>10/08/2020</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<span>BBC One joins the BBC's rolling news channel for a night of news.</span>
</p>
</div>
</div>
</div>
</div>
</div>
</li>
</ol>
</td>
<td class="day-1 br-box-page br-box-subtle br-subtle-bg-onbg080">
<ol class="list-unstyled">
<li class="week-guide__table__item">
<div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-11T00:10:00+01:00">
<span class="timezone--time">00:10</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m00049tf">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m00049tf"
class="br-blocklink__link block-link__target"
aria-label="11 Aug 00:10: Have I Got a Bit More News for You, Series 57, Episode 2"
><span class="programme__title delta"><span>Have I Got a Bit More News for You</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>Episode 2</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<abbr title="Episode 2 of 9"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr> <span>Guest host Alan Johnson joins Ian Hislop and Paul Merton for the topical news quiz.</span>
<abbr class="repeat" title="Repeat">(R)</abbr> </p>
</div>
</div>
</div>
</div>
</div>
</li>
<li class="week-guide__table__item">
<div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-11T00:55:00+01:00">
<span class="timezone--time">00:55</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqf0">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqf0"
class="br-blocklink__link block-link__target"
aria-label="11 Aug 00:55: Weather for the Week Ahead, 11/08/2020"
><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>11/08/2020</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<span>Detailed weather forecast.</span>
</p>
</div>
</div>
</div>
</div>
</div>
</li>
</ol>
</td>
<td class="day-2 br-box-page br-box-subtle br-subtle-bg-onbg080">
<ol class="list-unstyled">
<li class="week-guide__table__item">
<div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-12T00:20:00+01:00">
<span class="timezone--time">00:20</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000ljnb">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000ljnb"
class="br-blocklink__link block-link__target"
aria-label="12 Aug 00:20: Surviving the Virus: My Brother & Me"
><span class="programme__title delta"><span>Surviving the Virus: My Brother & Me</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<span>Chris and Xand van Tulleken, both doctors, share their personal experiences of Covid-19.</span>
<abbr class="repeat" title="Repeat">(R)</abbr> </p>
</div>
</div>
</div>
</div>
</div>
</li>
</ol>
</td>
<td class="day-3 br-box-page br-box-subtle br-subtle-bg-onbg080">
<ol class="list-unstyled">
<li class="week-guide__table__item">
<div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-13T00:55:00+01:00">
<span class="timezone--time">00:55</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lq5x">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lq5x"
class="br-blocklink__link block-link__target"
aria-label="13 Aug 00:55: Weather for the Week Ahead, 13/08/2020"
><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>13/08/2020</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<span>Detailed weather forecast.</span>
</p>
</div>
</div>
</div>
</div>
</div>
</li>
</ol>
</td>
<td class="day-4 br-box-page br-box-subtle br-subtle-bg-onbg080">
<ol class="list-unstyled">
<li class="week-guide__table__item">
<div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
<div class="grid-wrapper">
<div class="broadcast__info grid ">
<h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-14T00:25:00+01:00">
<span class="timezone--time">00:25</span></h2>
</div>
<div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="b06h7lcq">
<div class="programme__body">
<h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/b06h7lcq"
class="br-blocklink__link block-link__target"
aria-label="14 Aug 00:25: New Tricks, Series 12, The Crazy Gang"
><span class="programme__title delta"><span>New Tricks</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 12</span>, <span>The Crazy Gang</span></span></a></h3>
<p class="programme__synopsis text--subtle centi">
<abbr title="Episode 10 of 10"><span datatype="xsd:int">10</span>/<span class="programme__groupsize">10</span></abbr> <span>The team investigate the bloody murder of a political activist 15 years ago.</span>
<abbr class="repeat" title="Repeat">(R)</abbr> </p>
</div>
</div>
</div>
</div>
</div>
</li>
</ol>
</td>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: httpClient.tar.gz
Type: application/gzip
Size: 12864 bytes
Desc: not available
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200810/844f3758/attachment-0001.gz>
More information about the User
mailing list