Code to read and parse XML data

Good Morning,

I have several thousand XML files that i need to parse and process. To that end I was reviewing a possible solution How to get an XML element text in a LibreOffice Basic macro? - Stack Overflow

Rather than use OLE Automation, I prefer to use native libraries provide by LO. Do you have an hints that I can follow up on? Thank you yet again.

Several methods; one is using a XFastParser, or XParser (there are respective FastParser and Parser services) - you would need to implement an XFastDocumentHandler; another would be creating a custom XSLT filter… It really depends on what “process” actually means in your question :wink:

1 Like

So i want to obviously load the XML document in a variable, then parse the document to find a node with a specific name, from that name check if it has an attribute with a specific name, and if that attribute exists check if the value, which will be non-numeric is one of several possible values. if so store the xml file name in firebird embedded database, as well as the value i am searching for to be further analyzed by a different group of people.

Showing an approach based on XFastParser:

' A handler, needs to be visible to allow returning from createUnknownChildContext
Dim oDocumentHandler
' End results will be stored here
Dim bFoundFoo As Boolean
Dim bFoundBar As Boolean
Dim sFoundBarValue As String
Dim bFoundBaz As Boolean

' Implementation of (some) methods of XFastDocumentHandler

Function DocHandler_setDocumentLocator(xLocator)
End Function

Function DocHandler_startDocument()
End Function

Function DocHandler_endDocument()
End Function

Function DocHandler_createUnknownChildContext(Namespace, Name, Attribs)
  DocHandler_createUnknownChildContext = oDocumentHandler
End Function

Function DocHandler_startUnknownElement(Namespace, Name, Attribs)
  If Name = "Foo" Then bFoundFoo = True
  unkAttribs = Attribs.getUnknownAttributes()
  For Each a In unkAttribs
    If a.Name = "Bar" Then
      bFoundBar = True
      sFoundBarValue = a.Value
    End If
  Next a
End Function

Function DocHandler_endUnknownElement(Namespace, Name)
End Function

Function DocHandler_characters(aChars)
  If aChars = "Baz" Then bFoundBaz = True
End Function

' Sample function processing a file
Sub Parse
  source = CreateUnoStruct("com.sun.star.xml.sax.InputSource")
  source.aInputStream = CreateUnoService("com.sun.star.ucb.SimpleFileAccess").openFileRead(ConvertToURL("path/to/file.xml"))
  parser = CreateUnoService("com.sun.star.xml.sax.FastParser")
  oDocumentHandler = createUnoListener("DocHandler_", "com.sun.star.xml.sax.XFastDocumentHandler")
  parser.setFastDocumentHandler(oDocumentHandler)
  parser.ParseStream(source)
  MsgBox bFoundFoo & " " & bFoundBar & " " & bFoundBaz & "; Bar was " & sFoundBarValue
End Sub

Given the code, and this path/to/file.xml:

<?xml version="1.0" encoding="UTF-8"?>
<xml>
    <Foo/>
    <Something Bar="Some Value"/>
    <SomethingElse>Baz</SomethingElse>
</xml>

the result of running Parse will be a message box with “True True True; Bar was Some Value”.

Indeed, this all may be also implemented in another scripting language (in anticipation of “use the real programming language!!!” :wink:) - but the technique of implementing a given UNO interface in Basic may be useful for some by itself.

1 Like

This looks like a good starting point. Thank you for your informative code!

well i have run into a problem

source = CreateUnoStruct("com.sun.star.xml.sax.InputSource")

gives me a null value

Please provide more info.

  1. Full Help|About information.
  2. Are you using a Basic macro, or are you maybe trying to use some external code (like VBS) to manipulate a secondary LibreOffice process?

In addition to @mikekaganski answers.

  1. A meaningful example of using the sax.Parser service is given in section “5.38.Parsing XML” book by A. Pitonyak AndrewMacro.odt.
  2. An alternative approach using the DOM model is described in another book by A. Pitonyak OOME_4_0.odt, sections “15.12.1 Read an XML file” and “15.12.2. Write XML File”.
1 Like

I get a struct with 4 empty elements

(Name)        (Value Type)      (Value) (AccessMode)
aInputStream  .io.XInputStream  -void-  
sEncoding     string            ""      
sPublicId     string            ""      
sSystemId     string            ""

Ok i exited LO and started again. In my previous instance, I also loaded the scriptforge library for other work i was doing. I guess that interfered with XML effort.

Ah hah. I forgot about that AndrewMacro book. It’s on page 122.

I for some reason had ScriptForge library loaded, but I do not now, and it is working.

import xml.etree.ElementTree as etree

tree = etree.parse("test.xml")

def handle(tree):
    tree = tree.getroot()
    b_foo = tree.find("Foo") is not None 
    b_bar = b_baz = False
    bar_value = ""
    for elem in tree:
        try:
            bar_value=elem.attrib["Bar"]
            b_bar = True
        except KeyError:
            pass            
        if elem.text=="Baz":
            b_baz = True
    return b_foo, b_bar, b_baz, bar_value
        

print(handle(tree))

or did I miss something?

2 Likes

@karolus no, you didn’t.

A general note: the code by @karolus uses a DOM-based parser. If your XMLs are large, I’d suggest to prefer SAX-based parsers. (Indeed, it’s also easy with Python.)

Hey that’s a good solution as well.

Hello,
This sample is what i will start using for my work. So this will be my solution. Thank you

Maybe, here should also be mentioned https://lxml.de/

Update:there is an example how to use lxml (but not only) to manipulate settings for pdf-export