How to get number of pages in a document (doc,docx,xlxs etc..) before converting to pdf

Hi,
Is there any efficient way to know the total no of pages in a document without rendering/converting it to pdf ?
Background : I have a docx file with 7500 pages , when I am trying to convert this file to pdf , it is taking around 26 minutes. But I would like to know the number of pages upfront quickly ? is that possible ?
I tried using UNO API but that is also taking time around 15 minutes.
can we know total number of pages quickly for large documents ?
another issue is, when I convert only 3 pages to pdf again it is taking 15 minutes (it is taking 26 minutes to convert whole 7500 pages). Is there any way to optimize it ?

I am flexible for any solution (like using UNO SDK, using directly soffice via headless etc…).
Regards,
Satya

  1. Page count.
    You have a DOCX; for that file format, you may hope to find some information in the package’s docProps/app.xml (e.g., use an XPath lookup for "/Properties/Pages"). Note, that this data, as well as the docProps/app.xml stream itself, is not required to exist, or to be up-to-date.
    For ODT, the same can be said about meta.xml stream, which may have "/office:document-meta/office:meta/meta:document-statistic/@meta:page-count".
    Trying to get the page count by actually opening a file in LibreOffice necessarily means all the process of parsing and laying out it, so in case of a heavy document, it’s expected to see long time. However, you are welcome to file performance bug reports with sample documents attached, which may be the only chance to actually find and improve some bottlenecks.

  2. “When I convert only 3 pages to pdf again it is taking 15 minutes” is also expected, based on above. See the “file perf bug reports” advice.

Code for @mikekaganski 's answer.

' Returns the xml.dom.XDocument for XML file from a zip archive or Nothing
Function GetDocXML(ByVal filePath As String, xmlPath As String) As Object
  Dim oZip As Object
  GetDocXML = Nothing
  On Error GoTo ErrLabel
  oZip = com.sun.star.packages.zip.ZipFileAccess.createWithURL(ConvertToURL(filePath)) 
  GetDocXML = createUnoService("com.sun.star.xml.dom.DocumentBuilder").parse(oZip.getByName(xmlPath))
ErrLabel:  
End Function

Sub TestGetDocxProps()
  Dim oDom As Object, colNodes As Object, tagName As String
  oDom = GetDocXML("C:\Temp\MyDoc.docx", "docProps/app.xml")
  tagName = "Pages"
  colNodes = oDom.getElementsByTagName(tagName)
  If colNodes.Length>0 Then 
    Msgbox "Expected pages: " & colNodes.item(0).FirstChild.NodeValue
  Else
    Msgbox tagName & " node not found"
  End If    
End Sub
1 Like

Do we have a similar code in C++ ?

I have provided code that uses UNO services. This code can be ported to any of the languages ​​supported by LibreOffice.
Of course, Python and other high-level languages ​​have their own tools for efficient implementation of the specified algorithm.

CLI is good enough :
unzip -p MyDoc.docx docProps/app.xml | perl -p0e 's/.+Pages>(\d+)<.+/$1/s'

1 Like