I am tasked with deleting all but the first page of over 10k .docxes. But I have learned that the docx data structure fundamentally has no notion of a “page” and they are calculated based on the header, font size, etc. parameters to render. What are my options, given that I can’t disturb any of the styling?
Export to PDF and manipulate the PDF.
I have tried this approach and while it works visually, contiguous text is no longer preserved as such since PDF are visual data structures. Extracting the text through libraries returns mostly gibberish.
Write a macro or Python script to open the files one by one, analyse the page layout and delete the corresponding text ranges.
Use a virtual PDF printer or Writer’s export options and specify only the wanted pages.
Sorry for the misunderstanding. What I mean is that since I am not working with English text, the pdfs are failing to preserve the text all but visually. so using python libraries are unable to retrieve the text correctly from the docxes.
Use any software to export the files to PDF. Then use a dedicated PDF tool to get the first page.
and why not using MS Word +VBA? Docx is the native file format of that software and it seems to be very easy with a bit of VBA Pages object (Word) | Microsoft Learn
Thank you for your answer. I am on Ubuntu so it will be a journey and a half to configure being able to use Word, but I guess it is my only way.
Find some PC running Windows and Word. There are plenty of them.
I would doubt that it’s that simple, when you know the position. If the word happens in the middle of a table; or of a section … it will require some work to have the XML still valid after the removal.
you are correct. to great dismay I have found that the breaks are only generated once I open and close the odts manually. for such a large number of documents this is impossible. guess I need to keep looking.
Create a macro that would go to the next page, select to the end, delete, save and close.
There would still be problems. E.g., it’s impossible to select from the middle of a table to the end of document…
doc = ThisComponent cursor = doc.CurrentController.getViewCursor() if cursor.jumpToNextPage() then cursor.gotoEnd(true) cursor.setString("") end if doc.storeToURL(aNewURL, array()) doc.close(true)
The macro would need to build the
You deleted your additional question, but the answer is yes you can execute macros from command-line:
In the meantime I did a little googling and arrived at an answer that almost works: I saved this macro, and tried to run it through the
soffice cli, but for some reason
--headless doesn’t seem to be working, the gui still opens for a split second before closing, which is extremely slow for my use case. What do you think I am doing wrong?
soffice --headless --nologo --invisible --nofirststartwizard macro:///Standard.Module1.del doc0.docx
You noticed that right, you cannot use macros with headless like that.
In your case, it would be easier to use normal mode, and modify the macro to open, process, and close files in a loop. This way, LibreOffice will not need to open and close repeatedly.
Thanks for the insight. That’s quite unfortunate, since I need to process at the very least 100k documents. without headless it will take an eternity. As an aside, can you think of any faster way to detect the last word in the first page, OR simply delete all but the first page? I’ve been losing my mind over this problem for three days now to no avail. Sorry for the bother.
I don’t see why can’t a macro like I described above handle 100k documents no less efficient. Or you can control LibreOffice headless from an external script, e.g. Python.
I see. Since I’ve only seen the macro run for one file after another I assumed it would be around the same for looping once. I will try this approach and let you know how it goes. Thanks for the help.
In addition, you can open Word documents in hidden mode, which will not create new windows.
You also need to consider that documents may be in a change-tracking state. In this case, you must end tracking before deleting pages.