Macro to convert html to pdf - while clearing beginning and end of html doc

sherab · January 10, 2021, 3:07pm

I have 100 + html docs downloaded from Firefox and saved as complete web pages.

I want to save them as pdf(Using Firefox addon - WE edit - the page can be saved as pdf - but it is blank)
Saving as just html file - saves without the pictures…
Saving as complete web page, opening in Writer and exporting to PDF works well.
*The HTML has three parts

Beginning - at the firs line of the doc and ends with a specific text line(this repeats on all the 100+ web pages)
Content
End - denoted by a specific Word and ends at the last line of the doc*

Here’s the macro I’d like to have:
The HTML opens in Writer
The Content is extracted
The doc is exported as pdf with the name of the original html
The new html is saved as docx

I am not a programmer, so I do not know how complicated this is - but any advice even to just automate part of the procees will be greatly appreciated.
Thanx

Lupp · January 10, 2021, 4:54pm

-1- Use a SearchDescriptor for the loaded document to find the “specific text” ending the intro.
-2- Select upwards to the very beginning.
-3- Delete selection.
-4- Use the SerarchDescriptor to find the keyword for the lead-out.
-5- Select to end of text.
-6- Delete selection.
-7- Export to .pdf by .storeToURL(). (FilterName = “writer_pdf_Export”)
-8- Export to …docx by .storeToURL(). (FilterName = “writer_OOXML” I suppose. I never use it.)
-9- Cloe the docukment.
-10- Delete it (the old file) if wanted - if not the old filae has the same URL as the new one…

(Generally I dissuade from using hostile file-formats.)

You may try to record a macro for the sequence of actions. You will need to rwework it manually, I’m afraid.

sherab · January 10, 2021, 8:15pm

Hi @Lupp
Many thanks for your detailed answer - I didn’t really expect such a rapid reply from any one.

As I am not a coder I am not sure about your answer - where do I find the SeachDescriptor?
Do I need to write a script??
Is it too difficult for a beginner??
Thank you again

Lupp · January 10, 2021, 9:53pm

Is it too difficult for a beginner?

It depends. Some may try and never succeed. Somebody may study (say) the texts by Andrew Pitonyak, and solve the thing the next day. (Kidding a bit.)
If there isn’t a certain amount of experience in programming and in using any API not to speak of the very special one available for LibreOffice, I would dissuade from trying. Might end up with a lot of wasted time.
You wouldn’t only need to do the specific kind of internal “document automation”, but also to organize the selection of the files to work on, and this should be the part where the macro-recorder won’t help you much.

Everything isn’t exactly difficult, but …
And in specific the “FilterData” for pdf export may be problematic. I also lack experience insofar.
If you can hire a schoolboy feeling somehow bored … but being steady enough …

sherab · January 10, 2021, 10:09pm

It’ll probably be easier if i just did it manually over a number of days.
Thank you for the advices.
Happy new year

Lupp · January 10, 2021, 11:30pm

If you think to actually need a “number of days” it might pay to find somebody who completes and “certifies” the code contained in the attached file for you if you don’t feel capable of doing it yourself.
Parts of hat code are tested, parts are slightly reworked versions of recorded macros.

A relevant fiunction is missing because a solution workable in an unknown environment seemed too complicated to do it “just so”.
clipAndExportTwice.odt
Basically your problem is highly specialized. Wjhat I could provide as an “answer” you find in my first comment. Beyond that it’s “development”.

KamilLanda · July 3, 2021, 6:47pm

Maybe you can do it with the libreoffice command-line functions. I suppose your OS is Windows.

To get a Content of the files: you can use Regex Batch Replacer Download Regex Batch Replacer 1.0
But do the backup before you use this program.

2 steps for the Regex Batch Replacer.

find what: .+?YourStartText

replace with: don’t fill or YourStartText

find what: YourEndText.+

replace with: don’t fill or YourEndText

Check the directory where it will replace correctly!!! Regex Batch Replacer is very quick program and it doesn’t forgive the bad choice of the directory - it can rewrites all files in the directory in few seconds without a pardon.

Then run Command line - you can press Win+R and then write cmd. It runs Command line.
Switch to your directory with the files you want to convert. For example you have the directory d:\myfiles, so then write (only bold text, no write Enter :-))

d: press Enter

cd myfiles Enter

And then write next lines to the Command line:

for %f in (*.odt) do ( Enter

start /wait “” “C:\Program Files\LibreOffice\program\soffice.exe” --headless --convert-to pdf %f Enter

) Enter

It will convert your ODT files to the PDF. It is slow, but it seems it is functional.

Sources:

I believe you will find alone the converting from HTML (or to DOCX) by this way. I hope it is possible.