How do I delete a page a docx, through the libreoffice cli?

amancapy · April 8, 2023, 10:30am

I am tasked with deleting all but the first page of over 10k .docxes. But I have learned that the docx data structure fundamentally has no notion of a “page” and they are calculated based on the header, font size, etc. parameters to render. What are my options, given that I can’t disturb any of the styling?

Villeroy · April 8, 2023, 11:13am

Export to PDF and manipulate the PDF.

amancapy · April 8, 2023, 11:15am

I have tried this approach and while it works visually, contiguous text is no longer preserved as such since PDF are visual data structures. Extracting the text through libraries returns mostly gibberish.

Villeroy · April 8, 2023, 11:17am

Write a macro or Python script to open the files one by one, analyse the page layout and delete the corresponding text ranges.

Villeroy · April 8, 2023, 11:18am

Use a virtual PDF printer or Writer’s export options and specify only the wanted pages.

amancapy · April 8, 2023, 11:21am

Sorry for the misunderstanding. What I mean is that since I am not working with English text, the pdfs are failing to preserve the text all but visually. so using python libraries are unable to retrieve the text correctly from the docxes.

Villeroy · April 8, 2023, 11:23am

Use any software to export the files to PDF. Then use a dedicated PDF tool to get the first page.
https://qpdf.readthedocs.io/en/stable/

Villeroy · April 8, 2023, 11:26am

and why not using MS Word +VBA? Docx is the native file format of that software and it seems to be very easy with a bit of VBA Pages object (Word) | Microsoft Learn

amancapy · April 8, 2023, 11:27am

Thank you for your answer. I am on Ubuntu so it will be a journey and a half to configure being able to use Word, but I guess it is my only way.

Villeroy · April 8, 2023, 11:37am

Find some PC running Windows and Word. There are plenty of them.

mikekaganski · April 8, 2023, 12:42pm

I would doubt that it’s that simple, when you know the position. If the word happens in the middle of a table; or of a section … it will require some work to have the XML still valid after the removal.

amancapy · April 8, 2023, 1:10pm

you are correct. to great dismay I have found that the breaks are only generated once I open and close the odts manually. for such a large number of documents this is impossible. guess I need to keep looking.

mikekaganski · April 8, 2023, 1:49pm

Create a macro that would go to the next page, select to the end, delete, save and close.
There would still be problems. E.g., it’s impossible to select from the middle of a table to the end of document…

doc = ThisComponent
cursor = doc.CurrentController.getViewCursor()
if cursor.jumpToNextPage() then
  cursor.gotoEnd(true)
  cursor.setString("")
end if
doc.storeToURL(aNewURL, array())
doc.close(true)

The macro would need to build the aNewURL.

Wanderer · April 9, 2023, 10:39am

You deleted your additional question, but the answer is yes you can execute macros from command-line:

amancapy · April 9, 2023, 10:45am

In the meantime I did a little googling and arrived at an answer that almost works: I saved this macro, and tried to run it through the soffice cli, but for some reason --headless doesn’t seem to be working, the gui still opens for a split second before closing, which is extremely slow for my use case. What do you think I am doing wrong?

soffice --headless --nologo --invisible --nofirststartwizard macro:///Standard.Module1.del doc0.docx

mikekaganski · April 9, 2023, 11:14am

You noticed that right, you cannot use macros with headless like that.

In your case, it would be easier to use normal mode, and modify the macro to open, process, and close files in a loop. This way, LibreOffice will not need to open and close repeatedly.

amancapy · April 9, 2023, 12:33pm

Thanks for the insight. That’s quite unfortunate, since I need to process at the very least 100k documents. without headless it will take an eternity. As an aside, can you think of any faster way to detect the last word in the first page, OR simply delete all but the first page? I’ve been losing my mind over this problem for three days now to no avail. Sorry for the bother.

mikekaganski · April 9, 2023, 12:39pm

I don’t see why can’t a macro like I described above handle 100k documents no less efficient. Or you can control LibreOffice headless from an external script, e.g. Python.

amancapy · April 9, 2023, 12:47pm

I see. Since I’ve only seen the macro run for one file after another I assumed it would be around the same for looping once. I will try this approach and let you know how it goes. Thanks for the help.

sokol92 · April 9, 2023, 12:51pm

In addition, you can open Word documents in hidden mode, which will not create new windows.
You also need to consider that documents may be in a change-tracking state. In this case, you must end tracking before deleting pages.