[Gambas-user] How to Disassemble XML/HTML

Tobias Boege taboege at gmail.com
Thu Aug 6 12:26:26 CEST 2020


On Thu, 06 Aug 2020, John Rose wrote:
> I've adapted the attached httpClient app: I forget where I found it or
> perhaps someone pointed me to it: just call it a senior moment! It now gets
> information from a BBC iPlayer server
> (https://www.bbc.co.uk/bbcfour/programschedules/p00fzl6b/2020/w30) on BBC
> programmes available on BBC Four or that are shortly to be broadcast. The
> app's output is 'printed' to the Console and displayed in a Text Area. I've
> attached part of the output (copied to httpClient.Programmes.Output.txt
> file) as well as the app's compressed source (renamed from httpClient.tar.gz
> to httpClient.txt as gmail won't let me send tar.gz files).
> 
> Here is an extract (lines 56-66) of the large 'block' of data (i.e.the
> second part of the  above output file) that I want to extract into
> appropriately named Gambas string variables (for the type, identifier,
> episodeNumber, description, datePublished, name, url of the first type's
> data; type, description, name, url of the second type's data; type & name of
> the third type's data):
> {"@type":"TVEpisode","identifier":"b01hpfhz","episodeNumber":2,
> "description":"Francesco da Mosto looks at Italy as the land of adventure
> and ambition.","datePublished":"2012-05-10","image
> ":"https:\/\/ichef.bbci.co.uk\/images\/ic\/480xn\/p01hfnsw.jpg","name":"Land
> of Fortune","url":"https:\/\/www.bbc.co.uk\/pro
> grammes\/b01hpfhz","partOfSeries":{"@type":"TVSeries","image":"https:\/\/ichef.bbci.co.uk\/images\/ic\/480xn\/p01l8mwd.jpg",
> "description":"Francesco da Mosto takes to the Italian road again in search
> of Shakespeare in Italy","identifier":"b01hl468"
> ,"name":"Shakespeare in Italy","url":"https:\/\/www.bbc.co.uk\/programmes\/b01hl468"},"publication":{"@type":"BroadcastEvent
> ","startDate":"2020-07-19T23:10:00+00:00","endDate":"2020-07-20T00:10:00+00:00","publishedOn":{"@type":"BroadcastService","b
> roadcaster":{"@type":"Organization","legalName":"British Broadcasting
> Corporation","logo":{"@type":"ImageObject","url":"http
> s:\/\/ichef.bbci.co.uk\/images\/ic\/1200x675\/p01tqv8z.png"},"name":"BBC","url":"https:\/\/www.bbc.co.uk\/"},"name":"BBC
> Fou
> r"}}},
> 
> Does anyone know of a tutorial and/or example(s) to do similar tasks?
> 
> PS a long shot request: I would have liked to attach the Ruby(?) code for
> the get-iplayer command-line app. I don't understand it at all. It's too big
> to attach for the Gambas User Lists: it's almost 10,000 lines. I can email
> it for perusal if anyone would like to look at it. There are a only a few
> things I want to know about it:
> I want to find out (e.g. for BBC Four) what the BBC server id is for each
> BBC TV & Radio channel's (e.g. BBC Four) iPlayer programmes. I think that
> get-iplayer gets info from a BBC server (what is its url?) the channel name
> & url (for the iPlayer schedule data e.g. the url quoted in my first
> paragraph).
> 

What you have there is JSON embedded in a <script> tag inside HTML.
Thankfully, this data has its own <script> tag which is appropriately
type'd as application/ld+json, so it is not so hard to extract using
gb.xml.html and gb.web:

  Dim h As New HtmlDocument(your file here...)
  Dim el As XmlElement
  Dim data As New Collection[]
  For Each el In h.GetElementsByTagName("script")
    If el.Attributes["type"] <> "application/ld+json" Then Continue
    data.Add(JSON.Decode(el.TextContent))
  Next

Apparently the block you quoted above is only the second ld+json block,
so you find it in data[1] after the above loop. The data is structured
with many imbricated objects and arrays and the data[1] Collection will
mimic this structure: it contains Collections and Vaiant[] which contain
Collections and eventually you get strings and numbers. I suggest you set
a breakpoint after this loop and use the IDE debugger to inspect this
structure to see if it contains the data you are looking for and which
path through these nested objects leads to it.

Regards,
Tobias

-- 
"There's an old saying: Don't change anything... ever!" -- Mr. Monk


More information about the User mailing list