Use a virtual PDF printer or Writer’s export options and specify only the wanted pages.
Sorry for the misunderstanding. What I mean is that since I am not working with English text, the pdfs are failing to preserve the text all but visually. so using python libraries are unable to retrieve the text correctly from the docxes.
Use any software to export the files to PDF. Then use a dedicated PDF tool to get the first page.
https://qpdf.readthedocs.io/en/stable/
and why not using MS Word +VBA? Docx is the native file format of that software and it seems to be very easy with a bit of VBA Pages object (Word) | Microsoft Learn
Thank you for your answer. I am on Ubuntu so it will be a journey and a half to configure being able to use Word, but I guess it is my only way.
Find some PC running Windows and Word. There are plenty of them.
I would doubt that it’s that simple, when you know the position. If the word happens in the middle of a table; or of a section … it will require some work to have the XML still valid after the removal.
you are correct. to great dismay I have found that the breaks are only generated once I open and close the odts manually. for such a large number of documents this is impossible. guess I need to keep looking.
Create a macro that would go to the next page, select to the end, delete, save and close.
There would still be problems. E.g., it’s impossible to select from the middle of a table to the end of document…
doc = ThisComponent
cursor = doc.CurrentController.getViewCursor()
if cursor.jumpToNextPage() then
cursor.gotoEnd(true)
cursor.setString("")
end if
doc.storeToURL(aNewURL, array())
doc.close(true)
The macro would need to build the aNewURL
.
You deleted your additional question, but the answer is yes you can execute macros from command-line:
In the meantime I did a little googling and arrived at an answer that almost works: I saved this macro, and tried to run it through the soffice
cli, but for some reason --headless
doesn’t seem to be working, the gui still opens for a split second before closing, which is extremely slow for my use case. What do you think I am doing wrong?
soffice --headless --nologo --invisible --nofirststartwizard macro:///Standard.Module1.del doc0.docx
You noticed that right, you cannot use macros with headless like that.
In your case, it would be easier to use normal mode, and modify the macro to open, process, and close files in a loop. This way, LibreOffice will not need to open and close repeatedly.
Thanks for the insight. That’s quite unfortunate, since I need to process at the very least 100k documents. without headless it will take an eternity. As an aside, can you think of any faster way to detect the last word in the first page, OR simply delete all but the first page? I’ve been losing my mind over this problem for three days now to no avail. Sorry for the bother.
I don’t see why can’t a macro like I described above handle 100k documents no less efficient. Or you can control LibreOffice headless from an external script, e.g. Python.
I see. Since I’ve only seen the macro run for one file after another I assumed it would be around the same for looping once. I will try this approach and let you know how it goes. Thanks for the help.
In addition, you can open Word documents in hidden mode, which will not create new windows.
You also need to consider that documents may be in a change-tracking state. In this case, you must end tracking before deleting pages.
For some reason the macro isn’t running at all in --hidden mode. Maybe the cursor commands can only work in full gui mode? soffice --hidden macro:///Standard.Module1.del doc0.docx
seems to have no effect.
For example, let’s transform oldDoc.docx → newDoc.docx
Sub ToOnePage
Dim doc, cursor, pageCount As Long
Dim props2(0) As New com.sun.star.beans.PropertyValue
Dim props(0) As New com.sun.star.beans.PropertyValue
props(0).Name="Hidden"
props(0).Value=True
doc = StarDesktop.LoadComponentFromUrl(ConvertToUrl("C:\Temp\oldDoc.docx"),"_default",0, props)
pageCount = doc.CurrentController.PageCount
cursor = doc.CurrentController.getViewCursor()
If cursor.jumpToNextPage() Then
cursor.gotoEnd(true)
cursor.setString("")
End If
props2(0).Name="FilterName"
props2(0).Value="MS Word 2007 XML"
doc.storeToURL(ConvertToUrl("C:\Temp\newDoc.docx"), props2)
doc.close(true)
End Sub
I can not even begin to explain how much agony you have saved me. Thank you so much for your help.