Get page number for all headings/titles in a docx file

I have a .docx document and would like to do the following: list all the headings (titles, subtitles, etc.) and their page numbers, programmatically.

I experimented with the uno sdk and python, but I find it a bit difficult to understand, due to the fact that there is no auto-completion in my ide (the uno library being compiled…). It also seems like the cursor used to get the text’s style (to know if it is a heading) is not the same one as the one to get page numbers.

If anyone knows how to make the link in between those 2 cursors or another way to get page numbers, please comment below !

Do you want to build a TOC? There is a more straightforward and easy way to do it. Make sure your headings are formatted with paragraph styles Heading n (n=1 for chapters, 2 for sub-chapters, …).

Where you want the TOC, Insert>TOC & Index>TOC, Index or Bibliography.

the thing is that I have thousands of documents and it has to be done programmatically, so I have to have an array with the information in it, not “simply” a toc in the document that can only be processed manually

Perhaps you might then consider editing your question to reflect the real situation; others will then know what suggestions (not) to make.

it doesn’t look like I can edit it unfortunately

An instance of TextLayoutCursor service implements both XParagraphCursor and XPageCursor, and should provide you a way to travel by paragraphs (getting their properties), and obtain the corresponding position’s page.

See the “14.5. Cursors” section of the wonderful book by Andrew Pitonyak OOME_4_1.odt.

Try this:

Option Explicit

' Shows document headings and page numbers.
Sub ShowHeadingPageNumber()
  Dim oDoc as Object, oPar As Object, oViewCursor As Object, oCalcDoc as Object
  Dim arr(9999), ind As Long, nPage As Long
  arr(0)=Array("OutlineNum", "Page", "Text")
  oViewCursor=ThisComponent.CurrentController.getViewCursor()
  For Each oPar In ThisComponent.Text    
    If oPar.supportsService("com.sun.star.text.Paragraph") Then
      If oPar.outlineLevel>0 Then
        oViewCursor.gotoRange oPar.Start, False
        nPage=oViewCursor.Page
        ind=ind+1
        arr(ind)=Array(oPar.outlineLevel, nPage, oPar.String)
      End If   
    End If
  Next oPar
  
  If ind>0 Then  ' Show array in new Calc document
    ReDim Preserve arr(ind)
    oCalcDoc=StarDesktop.LoadComponentFromUrl("private:factory/scalc","_default",0,Array())
    oCalcDoc.Sheets(0).GetCellRangeByPosition(0, 0, 2, ind).setDataArray arr
  End If
End Sub

I don’t quite understand your task and its purpose. You can actually navigate through the entire document using your cursor, just as you originally intended. You can use the technique suggested by @mikekaganski and kindly demonstrated by @sokol92 - iterate through all the paragraphs in the text and use the cursor only to indicate the page number. But I would not discard @ajlittoz’s proposal regarding the Table of Contents. In fact, why not try to use a ready-made tool? Open the document, create a Table of Contents, read it and close the document without saving. For example, like this - Collect Headers From docx.odt (12.7 KB)
I don’t know, maybe this can be written shorter in Python and maybe it will work faster. Just try this.

1 Like

I tried to detect Page Number only via TextCursor, because Visible Cursor didn’t follow sufficiently quickly the TextCursor when I opened the document as hidden and set .lockControllers → and put a lot of formatted text and images via macro to document → Visible Cursor was too slow sometimes and returned bad page number (for example returned the number of previous page twice, instead of previous page and actual page).

The “hack” is put some object that has the property for Page Number to the TextCursor and anchor it To Page, then get the Page Number of one and delete one.

Sub pageForTextCursor
	dim oDoc as object, oCur as object, oShape as object, oSize as new com.sun.star.awt.Size, iPage&
	oDoc=ThisComponent
	oCur=oDoc.Text.createTextCursor
	oCur.goToEnd(false) 'put cursor to position
	rem insert Shape with zero size Anchored to Page to TextCursor
	with oSize 'for example you can set the size 10x10 for testing
		.Width=0
		.Height=0
	end with
	oShape=oDoc.createInstance("com.sun.star.drawing.PluginShape") 'it is undemanding object
	with oShape
		.TextWrap=1
		.AnchorType=4 'at first Achor to Character (because .insertTextContent ignores Anchor to Page)
		.Size=oSize
		.AllowOverlap=true
		.Opaque=true
	end with
	oDoc.Text.insertTextContent(oCur, oShape, false) 'insert to TextCursor
	oShape.AnchorType=2 'Anchor to Page
	iPage=oShape.AnchorPageNo 'get the Page number
	oDoc.DrawPage.remove(oShape) 'delete Shape
	msgbox iPage
End Sub