[Gambas-user] How to Disassemble XML/HTML

John Rose john.aaron.rose at mailbox.org
Mon Aug 10 09:48:03 CEST 2020


What you have there is JSON embedded in a <script> tag inside HTML.
> Thankfully, this data has its own <script> tag which is appropriately
> type'd as application/ld+json, so it is not so hard to extract using
> gb.xml.html and gb.web:
>
>    Dim h As New HtmlDocument(your file here...)
>    Dim el As XmlElement
>    Dim data As New Collection[]
>    For Each el In h.GetElementsByTagName("script")
>      If el.Attributes["type"] <> "application/ld+json" Then Continue
>      data.Add(JSON.Decode(el.TextContent))
>    Next
>

Thanks, Tobias, for your code. I tried the above code (with the Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html included: was that overkill?) and it produced a large file named HTMLandXML.txt I've extracted some of it to an attached file named PartOfHTMLandXML.txt However, I have no idea of what to do with it. There are largish blocks of 'code' for each TV programme. Some sample lines which show the (emboldened) data that I want to obtain are:
line 212:  <h3 class="programme__titles"><a href="*https://www.bbc.co.uk/programmes/m00049tf*"
lines 214-215:   ><span class="programme__title delta"><span>*Have I Got a Bit More News for You*</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
lines 219-220:  <abbr title="*Episode 2 of 9*"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr>                <span>*Guest host Alan Johnson joins Ian Hislop and Paul Merton for the 
topical news quiz.*</span>
I have no idea how to extract this data from the file. Which Gambas Component should I use? Is there an example/tutorial/book on how to extract this data in Gambas? I am a real newbie to HTML & XML etc!

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200810/844f3758/attachment-0001.html>
-------------- next part --------------
    </script>                    </div></div>
                

        <div class="programmes-page text-base programmes-page--flush" role="main">
            <div id="programmes-content" class="skip-to-content-destination" tabindex="-1">Main content</div>
                        <div>
                <div class="b-g-p no-margin-vertical islet--vertical br-box-highlight br-highlight-bg-onbg080" id="schedule-header"><div class="grid-wrapper">
        <div class="grid 1/2 at bpw2 1/2 at bpe">
            <p class="beta">
                                    HD Schedule
                </p>
        </div>

                    <div class="grid 1/2 at bpw2 1/2 at bpe">
                <div class="text--right at bpw">
                                                                    <a href="#outlets" class="delta">See other regional BBC One variations</a></div>
            </div></div>
    <h1>
            <span class="visually-hidden">
                <span class="outlet">
                    BBC One HD Schedule
                </span>
            </span>
            <time class="date">
                10 - 16 August 2020
            </time>
        </h1>
    </div>
    
    <div class="g-f-l br-box-page">
            <div class="week-guide">
    <div class="week-guide__table-box hidden visible at bpe">
        <table class="week-guide__table br-box-page br-keyline--table">
            <thead class="week-guide__table__head g-f-l text-base br-box-highlight">
            <tr class="week-guide__table__days-row centi">
                <th class="date-list__page week-guide__table__nav br-box-highlight">
                    <a id="last-week" class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover text--no-ul text--shout" rel="prev" data-href-add-utcoffset="true" href="/schedules/p00fzl6n/2020/w32">
                        <svg tabindex="-1" aria-hidden="true" focusable="false" class="gelicon gelicon--centi gelicon--leading"><use xlink:href="#gelicon--basics--previous" /></svg>Last
                    </a>
                </th>
                                        <th scope="col" class="date-list__page week-guide__table__day day-0 br-box-page">
        <span class="box-link date-list date-list__page--current text--shout">
            <span class="date-list__item-line1">Today</span>
            <span class="date-list__item-line2">10 Aug</span>
        </span>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-1">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/11"
           data-href-add-utcoffset="true"
           aria-label="Tuesday 11 August"
        >
            <span class="date-list__item-line1">Tue</span>
            <span class="date-list__item-line2">11 Aug</span>
        </a>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-2">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/12"
           data-href-add-utcoffset="true"
           aria-label="Wednesday 12 August"
        >
            <span class="date-list__item-line1">Wed</span>
            <span class="date-list__item-line2">12 Aug</span>
        </a>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-3">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/13"
           data-href-add-utcoffset="true"
           aria-label="Thursday 13 August"
        >
            <span class="date-list__item-line1">Thu</span>
            <span class="date-list__item-line2">13 Aug</span>
        </a>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-4">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/14"
           data-href-add-utcoffset="true"
           aria-label="Friday 14 August"
        >
            <span class="date-list__item-line1">Fri</span>
            <span class="date-list__item-line2">14 Aug</span>
        </a>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-5">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/15"
           data-href-add-utcoffset="true"
           aria-label="Saturday 15 August"
        >
            <span class="date-list__item-line1">Sat</span>
            <span class="date-list__item-line2">15 Aug</span>
        </a>
    </th>

                                        <th scope="col" class="date-list__page week-guide__table__day day-6">
        <a class="box-link br-page-bg-onbg--hover br-page-linkhover-ontext--hover date-list text--no-ul text--shout"
           href="/schedules/p00fzl6n/2020/08/16"
           data-href-add-utcoffset="true"
           aria-label="Sunday 16 August"
        >
            <span class="date-list__item-line1">Sun</span>
            <span class="date-list__item-line2">16 Aug</span>
        </a>
    </th>

                                <th class="date-list__page box-link week-guide__table__nav br-box-highlight trail">
                    <a id="next-week" class="br-page-bg-onbg--hover br-page-linkhover-ontext--hover text--no-ul text--shout" rel="next" data-href-add-utcoffset="true" href="/schedules/p00fzl6n/2020/w34">
                        Next<svg tabindex="-1" aria-hidden="true" focusable="false" class="gelicon gelicon--centi gelicon--trailing"><use xlink:href="#gelicon--basics--next" /></svg>
                    </a>
                </th>
            </tr>
            </thead>
            <tbody class="week-guide__table__body" data-page-time="2020/08/10">
                            <tr class="week-guide__table__hour-row">
                    <th class="hour" scope="row" data-timezone="true" content="2020-08-10T00:00:00+01:00">
                        <span class='week-guide__table__hour timezone--time'>00:00</span>
                    </th>
                                                <td class="day-0 br-box-page">
            <ol class="list-unstyled">
                            <li class="week-guide__table__item">
                    <div class="broadcast broadcast--has-ended block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-10T00:30:00+01:00">
                <span class="timezone--time">00:30</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqfy">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqfy"
           class="br-blocklink__link block-link__target"
           aria-label="10 Aug 00:30: Weather for the Week Ahead, 10/08/2020"           
        ><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>10/08/2020</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                <span>Detailed weather forecast.</span>
            </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                            <li class="week-guide__table__item">
                    <div class="broadcast broadcast--has-ended block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-10T00:35:00+01:00">
                <span class="timezone--time">00:35</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqg0">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqg0"
           class="br-blocklink__link block-link__target"
           aria-label="10 Aug 00:35: Joins BBC News, 10/08/2020"           
        ><span class="programme__title delta"><span>Joins BBC News</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>10/08/2020</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                <span>BBC One joins the BBC's rolling news channel for a night of news.</span>
            </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                    </ol>
    </td>

                                                <td class="day-1 br-box-page br-box-subtle br-subtle-bg-onbg080">
            <ol class="list-unstyled">
                            <li class="week-guide__table__item">
                    <div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-11T00:10:00+01:00">
                <span class="timezone--time">00:10</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m00049tf">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m00049tf"
           class="br-blocklink__link block-link__target"
           aria-label="11 Aug 00:10: Have I Got a Bit More News for You, Series 57, Episode 2"           
        ><span class="programme__title delta"><span>Have I Got a Bit More News for You</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>Episode 2</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                                <abbr title="Episode 2 of 9"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr>                <span>Guest host Alan Johnson joins Ian Hislop and Paul Merton for the topical news quiz.</span>
        <abbr class="repeat" title="Repeat">(R)</abbr>    </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                            <li class="week-guide__table__item">
                    <div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-11T00:55:00+01:00">
                <span class="timezone--time">00:55</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lqf0">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lqf0"
           class="br-blocklink__link block-link__target"
           aria-label="11 Aug 00:55: Weather for the Week Ahead, 11/08/2020"           
        ><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>11/08/2020</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                <span>Detailed weather forecast.</span>
            </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                    </ol>
    </td>

                                                <td class="day-2 br-box-page br-box-subtle br-subtle-bg-onbg080">
            <ol class="list-unstyled">
                            <li class="week-guide__table__item">
                    <div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-12T00:20:00+01:00">
                <span class="timezone--time">00:20</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000ljnb">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000ljnb"
           class="br-blocklink__link block-link__target"
           aria-label="12 Aug 00:20: Surviving the Virus: My Brother & Me"           
        ><span class="programme__title delta"><span>Surviving the Virus: My Brother & Me</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                <span>Chris and Xand van Tulleken, both doctors, share their personal experiences of Covid-19.</span>
        <abbr class="repeat" title="Repeat">(R)</abbr>    </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                    </ol>
    </td>

                                                <td class="day-3 br-box-page br-box-subtle br-subtle-bg-onbg080">
            <ol class="list-unstyled">
                            <li class="week-guide__table__item">
                    <div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-13T00:55:00+01:00">
                <span class="timezone--time">00:55</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="m000lq5x">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/m000lq5x"
           class="br-blocklink__link block-link__target"
           aria-label="13 Aug 00:55: Weather for the Week Ahead, 13/08/2020"           
        ><span class="programme__title delta"><span>Weather for the Week Ahead</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>13/08/2020</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                <span>Detailed weather forecast.</span>
            </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                    </ol>
    </td>

                                                <td class="day-4 br-box-page br-box-subtle br-subtle-bg-onbg080">
            <ol class="list-unstyled">
                            <li class="week-guide__table__item">
                    <div class="broadcast block-link block-link--steal broadcast--grid highlight-box--list br-keyline br-blocklink-page br-page-linkhover-onbg015--hover">
    <div class="grid-wrapper">
        <div class="broadcast__info grid ">
                        <h2 class="broadcast__time gamma" data-timezone="true" datatype="xsd:dateTime" content="2020-08-14T00:25:00+01:00">
                <span class="timezone--time">00:25</span></h2>

            </div>
        <div class="broadcast__programme grid " rev="publication"><div class="programme programme--tv programme--episode block-link" data-pid="b06h7lcq">
        


    <div class="programme__body">
                <h3 class="programme__titles"><a href="https://www.bbc.co.uk/programmes/b06h7lcq"
           class="br-blocklink__link block-link__target"
           aria-label="14 Aug 00:25: New Tricks, Series 12, The Crazy Gang"           
        ><span class="programme__title delta"><span>New Tricks</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 12</span>, <span>The Crazy Gang</span></span></a></h3>

                
    <p class="programme__synopsis text--subtle centi">
                                <abbr title="Episode 10 of 10"><span datatype="xsd:int">10</span>/<span class="programme__groupsize">10</span></abbr>                <span>The team investigate the bloody murder of a political activist 15 years ago.</span>
        <abbr class="repeat" title="Repeat">(R)</abbr>    </p>



    </div>
</div>
</div>
    </div>
</div>


                </li>
                    </ol>
    </td>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: httpClient.tar.gz
Type: application/gzip
Size: 12864 bytes
Desc: not available
URL: <https://lists.gambas-basic.org/pipermail/user/attachments/20200810/844f3758/attachment-0001.gz>


More information about the User mailing list