Is it possible to view and modify with starbasic XML opendocument code of current document writer?

Is there a way to view from starbasic macro the XML opendocument encoding of any textContent, eg a paragraph or a table, of the currently displayed writer document?

For small documents you can save in fodt format and examine with an editor, but for very large documents it is almost impossible, unless you copy and paste small fragments of them into a new document to be saved in .fodt format

If it were possible to do it from a .odt (and not from a .fodt) could changes be made and saved as well?
Thanks for any suggestions

(LO 7.3.5 - Ubuntu 16.04)

edit: I have some very large documents (> 300 pages) with a mix of direct and styling formatting, increasingly slow to manage, probably also grown for the above mix and for the uncontrolled growth of automatic styles.
I am going to transfer, one segment at a time, these files to a new “clean” .odt using only styles
Being a very long job, seeing the source XML encoding I can limit the interventions only where really needed,

Edit your question to explain why you want to play with the XML. There may be better and simpler ways, like modifying some style, paragraph, character or page, to achieve the result. There is usually a one-to-one mapping between XML properties and styles.

1 Like

There’s no XML in memory while working with a document model. Despite being aligned with ODF, the model itself is not XML-based. Only on export, respective filter generates it.

To allow close inspection, Style Inspector was introduced in 7.1. Despite being imperfect, (it only shows paragraph and character properties, no frame or numbering or other), it might be interesting to you.

1 Like

Thanks @mikekaganski, I had not found any service or API with accesses to the underlying XML, your answer is the confirmation.
The styles inspector is of great help, and I use it with my starbasic macros that access the portions in the paragraphs, if any, to limit their presence and therefore the growth of XML and automatic styles during the transfer work in the clean file with a stable style sheet without direct formatting
Direct access to the underlying XML was something more
Thanks

Hi,

if I understand your situation correctly, it is a one time task to streamline your 300 page odts. In this case I would recommend to go through the normal api functions. The below code just shows a very simple sample for paragraph style modification.

The code also demoes, how the internal xml representation of an odt can be accessed through the AO API. In the example, it shows the sizes of content.xml and styles.xml before and after the modification through the AO API. Again, I recommend to use the API and not direct XML modification. The latter has the potential to be faster, but for a one time effort I would not take the complexity overhead

Good luck,

ms777

Sub Main
	oDoc = ThisComponent

	msgbox "before"
	ShowContentAndStyles(oDoc)
	
	oParaEnum = oDoc.Text.createEnumeration()
	while oParaEnum.hasMoreElements
		oPara = oParaEnum.NextElement
		msgbox oPara.ParaStyleName
		oPara.ParaStyleName = "StyleCourierBlue"
	wend

	msgbox "after"
	ShowContentAndStyles(oDoc)
End Sub

sub ShowContentAndStyles(oDoc)
	abContentXml = getZipContentAsByteArray(oDoc, "content.xml")
	abStylesXml = getZipContentAsByteArray(oDoc, "styles.xml")

	msgbox "content.xml len: " + UBound(abContentXml) + ", styles.xml len: " + UBound(abStylesXml)
'	sContentXml =  getTextInputStreamFromByteArray(abContentXml).readString(array(), true)
'	msgbox sContentXml
	
	ox = createUnoService("com.sun.star.xml.dom.DocumentBuilder")
	domContent = ox.parse(getTextInputStreamFromByteArray(abContentXml))
	domStyles = ox.parse(getTextInputStreamFromByteArray(abStylesXml))

	oAutomaticStyles = domContent.getElementsByTagNameNS("urn:oasis:names:tc:opendocument:xmlns:office:1.0", "automatic-styles").item(0)
	msgbox "Automated Styles in Content.xml:"+Chr(13)+Chr(10) + getXMLString(oAutomaticStyles)
	
	oCustomStyles = domStyles.getElementsByTagNameNS("urn:oasis:names:tc:opendocument:xmlns:office:1.0", "styles").item(0)
	msgbox "Styles in Styles.xml:"+Chr(13)+Chr(10) + getXMLString(oCustomStyles)
end sub

function getZipContentAsByteArray(oDoc as Object, sFile as String) as Object
	ox = createUnoService("com.sun.star.packages.zip.ZipFileAccess")

	oStorageFac = createUnoService("com.sun.star.embed.StorageFactory") 
	oStorage    = oStorageFac.createInstance
	oStream     = oStorage.openStreamElement("ms777", com.sun.star.embed.ElementModes.READWRITE) 

	Dim storeProps(0) as new com.sun.star.beans.PropertyValue
	storeProps(0).Name = "OutputStream"
	storeProps(0).Value = oStream
	oDoc.storeToUrl("private:stream", storeProps())
	
	Dim args(1) as Object
	args(0) = oStream
	ox.initialize(args)
	
	oContent = ox.getByname(sFile)
	lLength = oContent.available()
	Dim abyte(lLength) as byte
	lSuccess = oContent.readBytes(abyte, lLength)
	if lSuccess <> lLength then
		msgbox "Houston ..."
		exit function
	endif
	getZipContentAsByteArray = abyte
End function

function getTextInputStreamFromByteArray(aByte) as Object
	oPipe = createUnoService("com.sun.star.io.Pipe")
	oTextInputStream = createUnoService("com.sun.star.io.TextInputStream")
  	oTextInputStream.setInputStream(oPipe)
	oPipe.writeBytes(aByte)
	oPipe.closeOutput()	
	getTextInputStreamFromByteArray = oTextInputStream
end function

rem thanks to Axel Richter http://de.openoffice.info/viewtopic.php?t=66828 

function getXMLString(oXMLElement as object) as string
  oDocumentBuilder = createUnoService("com.sun.star.xml.dom.DocumentBuilder")
  oDOMDocumentNew = oDocumentBuilder.newDocument()
  oXMLElementNew = oDOMDocumentNew.importNode(oXMLElement, true)
  oDOMDocumentNew.appendChild(oXMLElementNew)

  oPipe = createUnoService("com.sun.star.io.Pipe")
  oTextInputStream = createUnoService("com.sun.star.io.TextInputStream")
  oTextInputStream.setInputStream(oPipe)
      
  oDOMDocumentNew.setOutputStream(oPipe) 
  oDOMDocumentNew.start()
  oPipe.closeOutput()
  
  sXML =  oTextInputStream.readString(array(), true)
   
  getXMLString = sXML
end function

Styles.odt (15.3 KB)

1 Like

Thank you so much!! Fantastic, I have never seen such an application and that category of libreoffice services.
I would say that it solves my problem abundantly and probably others like it too.
You can propose it as your solution. For me, the problem I posed is solved
Thank you so much again !!

As already mentioned there is no XML represenation of the current state of the open document. However, you can view, search, and edit the written .odt (zipped archive; copied file) file using an unzipper like WinZip or 7zip (the ones I am using) picking and opening the wrapped-in files, editing them, and writing them back. I prefer Notepad++ with XMLtools add-in in the rare cases I do such things.
The usability of the result will, of course, depend.
If you know useful features of Starbasic(??) or the LibreOffice API(?) for the purpose, you can copy XML from the archive into a LibO document, work on it (the plain character-string) there using your own code, and then copy/paste the result back, close the intermediary editor saving the content, and close the archive updating it.
(I personally only use the process now and then for repairs of corrupted files, content.xml.)

Thanks Lupp, but years ago I already worked with this procedure. I unzip all the .odt file in a folder, edit the content.xml file and recompress the folder to get the .odt back
From my tests at the time, the mimetype file, however, had to be protected from compression (with the -x mimetype option), otherwise the .odt file could not be opened.
I don’t know if the procedure has changed today.
However, this procedure is essential, as you observed, for damaged files.
For a cleaning operation of the direct formatting of a 1Mb .odt file (therefore still compressed) it is impractical, even extracting small segments at a time.
Though heavy on hundreds of page files, it’s faster to copy and paste the text without formatting to a new ‘clean’ .odt file and reformat everything with styles.
However, the editor inspector is helpful
It was different to be able to see the XML encoding next to each paragraph to see if and how much there is to clean.
But you are confirming that it is not possible because the XML is not stored in memory nor is it reconstructed on request on the individual paragraphs (which theoretically seemed possible to me, however, only having services and APIs of which I have not seen a trace)
Thanks again to everyone

Strange idea. These few bytes never were and never will be packed. They give information to software ordered to try to open the archive before it can know that there is an archive at all.

The procedure I described wouldn’t touch that. I obviously don’t understand your intentions and the issues with them sufficiently.
If you want to work on the styles.xml in Notepad++ opening it with F4 from 7z e.g, you won’t need any explicit extraction. Only the pathname for the editor (Np++ in my case) must be correctly given in the 7z options…
You may also use the layout features of the XMLtools extension without doing any harm.

1 Like

I think it’s just a procedural problem. You probably open the odt file with 7z and only work on content.xml without touching the rest
I used a longer (and less efficient) system, unpacking all the contents of the .odt file into a folder, editing content.xml and recompacting the entire folder.
During the recompression, of course, mimetype was also compressed with all the other files (which originally was not compacted)
For this I had to protect mimetype from compaction against all other files, which you didn’t need.
Rather, you gave me the idea that I hadn’t thought about both working selectively on style.xml only to see the automatic styles situation and installing under ubuntu notepad ++, probably more suitable for displaying large unpacked content.xml (2 -5Mb), files that opened with gedit (ubuntu system editor) crash it (at least this happens to me)
Thanks Lupp

Saving as fodt is flat text, could make the work easy.
https://wiki.documentfoundation.org/Libreoffice_and_subversion

Thanks Mariosv. I mentioned the .fodt format at the beginning
I learned the little I know about the opendocument, documentation aside, just by saving a .odt as .fodt and opening it with an editor.
So I realized the weight on the growth of the underlying XML due to the direct formatting, applied several times, the uncontrolled increase of automatic styles and the growth of the “portions” in the paragraphs.
All this is irrelevant on small files, but devastating on .odt files of 1 Mb or more, the files I work on.

Opening with an editor a .fodt that, when expanded, exceeds 5Mb or unpacking a content.xml of similar size is not practical.
LO uses rules to convert from opendocument to UNO hierarchy on loading, and from UNO hierarchy to opendocument to saving.
Strange (but perhaps it was considered superfluous) that the possibility, in the style inspector, was not provided for obtaining the XML encoding of the textContent selected (paragraph, portion of paragraph, table, etc.) or of the current page only, because in saving this is done.
However, the inspector editor is already an important step forward
I am more and more fascinated by what you can do with libreoffice, especially now that I use the latest version (I used 6.2 before) combining functionality and starbasic programming