[Gambas-user] How to Disassemble XML/HTML

Manu mtitouinfo at yahoo.fr
Mon Aug 10 15:14:50 CEST 2020


Le 10/08/2020 ?? 09:48, John Rose a ??crit??:
> What you have there is JSON embedded in a <script> tag inside HTML.
>> Thankfully, this data has its own <script> tag which is appropriately
>> type'd as application/ld+json, so it is not so hard to extract using
>> gb.xml.html and gb.web:
>>
>>    Dim h As New HtmlDocument(your file here...)
>>    Dim el As XmlElement
>>    Dim data As New Collection[]
>>    For Each el In h.GetElementsByTagName("script")
>>      If el.Attributes["type"] <> "application/ld+json" Then Continue
>>      data.Add(JSON.Decode(el.TextContent))
>>    Next
>>
> 
> Thanks, Tobias, for your code. I tried the above code (with the Gambas Components gb.net, gb.net.curl, gb.web, gb.xml, gb.xml.html included: was that overkill?) and it produced a large file named HTMLandXML.txt I've extracted some of it to an attached file named PartOfHTMLandXML.txt However, I have no idea of what to do with it. There are largish blocks of 'code' for each TV programme. Some sample lines which show the (emboldened) data that I want to obtain are:
> line 212:  <h3 class="programme__titles"><a href="*https://www.bbc.co.uk/programmes/m00049tf*"
> lines 214-215:   ><span class="programme__title delta"><span>*Have I Got a Bit More News for You*</span></span><span class="hidden">—</span><span class="programme__subtitle centi"><span>Series 57</span>, <span>*Episode 2*</span></span></a></h3>
> lines 219-220:  <abbr title="*Episode 2 of 9*"><span datatype="xsd:int">2</span>/<span class="programme__groupsize">9</span></abbr>                <span>*Guest host Alan Johnson joins Ian Hislop and Paul Merton for the 
> topical news quiz.*</span>
> I have no idea how to extract this data from the file. Which Gambas Component should I use? Is there an example/tutorial/book on how to extract this data in Gambas? I am a real newbie to HTML & XML etc!
> 
> 
> 
> ----[ http://gambaswiki.org/wiki/doc/netiquette ]----
> 
hello,   here a example;

Public Function GetBBCProgram(Optional solution As Byte, Optional sdate 
As Date) As Collection

  Dim hc As New HttpClient, d As New HtmlDocument, hl As New Collection, 
hkey As String
  Dim res As String, f As XmlElement, n As XmlNode, vdate As String
  If Not sdate Then vdate = Format$(Now, "yyyymmdd") Else vdate = 
Format$(sdate, "yyyymmdd")
  With hc
   .URL = "https://www.bbc.co.uk/iplayer/guide/bbcfour/" & vdate
   .Timeout = 2
   .targetfile = "/tmp/foo.html"
  End With
  res = hc.Download(hc.url)
  'File.Save(hc.TargetFile, res)
  d.HtmlFromString(res)

Select Case solution
Case 1
   shunt:
   Print "1---"
   For Each f In d.Root.GetChildrenByClassName("schedule-item")
   'Print f.GetAttribute("class"), f.type, f.element, f.name, f.value
     Print f.value
     hkey = Left$(f.value, 5)
     hl.Add(Replace(f.value, hkey, ""), hkey)
   Next
   Return hl
Case 2
   Print "2----"
   For Each f In d.Root.GetChildrenByFilter("li[class*=sched]")
     Print f.GetAttribute("class"), f.type, f.element, f.name, f.value
   Next
Case 3
   Print "3--debug---elementnode"
   For Each n As XmlNode In d.Root.AllChildNodes
     If n.Type = 1 Then Print n.Attributes, "!!", n.Name, ":", n.Value
   Next
Case 4
   Print "4--debug---textnode"
   For Each n In d.Root.AllChildNodes
    If n.Type = 2 Then Print n.Attributes, "!!", n.Name, ":", n.Value
   Next
Case Else
   Goto shunt
End Select
End

Public Sub Main()
Dim foo As Collection = GetBBCProgram(DateAdd(Now, gb.Day, 1)) 'tomorrow
For Each item As Variant In foo
   Print foo.key, "!!", item
Next
End

you case use "lynx -dump or w3m -dump" ; bye


More information about the User mailing list