Can't open docx on MS Office, which is converted on LibreOffice

dklee · August 5, 2023, 5:40pm

Could you help me?

I can’ t open docx on MS Office(2016), but can open it on LibreOffice.
The docx file is converted from pdf on LibreOffice(7.5.5), and I’ve used java org.libreoffice version 7.5.3,

I’ve used the program DocumentConverter.java to convert it. (from /api.libreoffice.org/examples/java/DocumentHandling/)

DocumentConverter.java: only two points were changed like this:
(1) line No: 126
propertyValues[1] = new com.sun.star.beans.PropertyValue();
propertyValues[1].Name = “FilterName”;
propertyValues[1].Value = “writer_pdf_Export”
(2) line No: 137
xStorable.storeAsURL() → xStorable.storeToURL()

I will attach two data files (conversion input: pdf , output: docx).
Thanks in advance.

TEST_DOC.pdf (53.2 KB)
TEST_DOC.docx (31.6 KB)

Zizi64 · August 6, 2023, 6:05am

It is a nonsense.
.
Use the source document, but not a PDF “printed” version.
Use the native ODF file format for your important documents.
The .docx compatibility never will be 100% quality.
.
The LO is not a PDF editor. The PDF files will be open in the Draw application (but not in the Writer). The text will be appeared as some graphical textboxes, but not as a normal Writer text flow. And you can do some cosmetics only on the textboxes of the PDF file.

EarnestAl · August 6, 2023, 10:59pm

You have Word 2016 which can convert pdf to docx, see How to Convert a PDF to Word in Microsoft Word (for Free - No Third Party Programs Needed)

If you are on Windows, you can use use Windows built-in OCR to read images of text, see Optical Character Recognition (OCR) for Windows 10 - Windows Developer Blog

Or even a screen ocr, PowerToys Text Extractor utility for Windows | Microsoft Learn

Involving LibreOffice would be an unnecessary extra conversion in a process of .docx → pdf → docx

Note that LibreOffice can create a hybrid pdf containing the original LibreOffice document inside the pdf avoiding the entire process above.

dklee · August 7, 2023, 3:18pm

I didn’t post this because I couldn’t process this test file.
Along with other features of LibreOffice, this feature was also intended to be completed.

Will LibreOffice remove the ability to convert docx from pdf in the future?
(As you know test file was super simple ( has only texts, not contains complex ones ).

anon87010807 · August 7, 2023, 6:17pm

Did you read the comments? You ordered LibreOffice to export a PDF file as PDF. Now you complain that it didn’t convert it into the docx format. Remember that a computer will always do exactly what you tell it to do, which is not always what you want it to do. I have been there, writing computer code that didn’t do what I wanted it to do on many occasions.
LibreOffice doesn’t even save PDF files to its native odt format, let alone to docx, and it probably never will.

dklee · August 10, 2023, 7:37am

Thanks for your reply.

Looking at the comments alone, one can’t help but wonder.

I was puzzled because many answers said to use writer_pdf_Export to do the PDF->DOCX conversion. (I’m sure you’ll find it easy to do this in other answers as well, many of them do).

Using other methods (such as writer_pdf_import, etc.) for the conversion resulted in an error, and writer_pdf_Export was the only one that created the file.

If you know how to do this, I would appreciate the exact information.

P.S.
LibreOffice doesn’t even save PDF files to its native odt format, let alone to docx, and it probably never will. ==> I disagree with you on this one, it’s a weird phenomenon, which is why I made the post, isn’t it ?
Are you responding because you tried it yourself ?

mikekaganski · August 10, 2023, 8:13am

I do not find them. If you have some references, please mention them.

Indeed. You use the filter in the storeToURL call, which implies that only export filters would work here. If you want to import a PDF into Writer, you have to use an import filter (which is writer_pdf_import in this case) in e.g. loadComponentFromURL call.

Generally: there is no “PDF->DOCX conversion” in LibreOffice. There are two operations:

Import from some file into some component (e.g., into Writer, or Draw, or Calc);
Export from that component into some file.

The first operation uses some import filter. The filter may be auto-detected (e.g., when you do not define it explicitly); in case of PDF, the default is draw_pdf_import, and will load the graphical objects to Draw; or you may define the import filter explicitly, and then you can control which component your file data arrives into.

The second operation uses some export filter. The repertoire of filters depends on the component - you may get an idea of what is available, if you create a new document in respective component, and check Save As and Export dialogs’ filter lists.

dklee · August 10, 2023, 3:15pm

Thanks for your reply.
I will understand what you said as your final state.

Here is my answer to your question.
Answers that said the conversion was possible ( Threre are more, I’ll list only 2 )
=> docker - How to convert pdf to docx in libreoffice 6.4? - Stack Overflow
=> command line - Convert PDF to Word Using Libreoffice in terminal - Ask Ubuntu
=> Some replys use …import, the others use …Export.

I also found this now.
Libreoffice convert-to not working.
LibreOffice used to convert Pdf to various Word Formats (doc, docx, rtf),
but it no longer seems to work!
=> Can I convert a PDF file to a Word File? - #11 by Shineona
(See ebbandari’s thread)

I wish I had gotten a more direct answer.
There might be people who wander like me without knowing the histoty of LibreOffice.

mikekaganski · August 10, 2023, 3:32pm

@dklee:

I assume that you answered the “If you have some references, please mention them” “question”.

In all three references you provided, there is not a single mention of the “writer_pdf_Export”. In all of them, the import filter was used together with the --infilter command line switch, which is used to define how to import the file.

I do not understand your “I will understand what you said as your final state”. But I suggest you to re-read my comment, and try to comprehend the process. You may open a PDF, and save to a Writer document (like DOCX or ODT), and you are free to call that “conversion”; but I tried to explain to you which two processes are in play. Unless you understand them, you will use trial-and-error approach, misinterpreting what you read in some correct answers here and there.

dklee · August 10, 2023, 4:41pm

Let’s focus what I asked .
It was about the conversion of PDF->DOCX, not writer_pdf_Export.
~import result in error ( none produced ), but ~Export produced the above TEST_DOC.docx I posted.
That’s it. Any thing would be good.

At this point, I think we need to talk about the above ebbandari’s thread.
Isn’t that a conclusion based on whether it’s true or not?

@mikekaganski I do not understand your “I will understand what you said as your final state
This means LibreOffice used to provide the conversion, but not any more.
I understand this is current final state.

I suddenly asked for a case for …Export, so I couldn’t find it easily, though this is the case on Mac.
=> Shift to PDF as preview file format : CafeTran

anon87010807 · August 10, 2023, 4:50pm

@dklee: Your link to Shift to PDF as preview file format : CafeTran is a prime example of the opposite of what you want. It’s about exporting in PDF format, not about importing PDF and exporting it as odt or docx.

dklee · August 10, 2023, 4:53pm

Sorry, I was looking for something else in a hurry.
(Now that I think about it, I think I might not have seen it as well then as I do now.)

The final state of PDF->DOCX conversion is known as a function that does not work as described above.

You did a really good job.
Thank you.

mikekaganski · August 10, 2023, 4:57pm

Sigh.

LibreOffice can import PDF to Writer. Then it can export to DOCX. You don’t try to understand what you are told.

In command line:

soffice --infilter=writer_pdf_import --convert-to docx --outdir path/to/output_dir path/to/source.pdf

In Java code:

    com.sun.star.beans.PropertyValue propertyValues[] =
        new com.sun.star.beans.PropertyValue[2];
    propertyValues[0] = new com.sun.star.beans.PropertyValue();
    propertyValues[0].Name = "Hidden";
    propertyValues[0].Value = Boolean.TRUE;
    propertyValues[1] = new com.sun.star.beans.PropertyValue();
    propertyValues[1].Name = "FilterName";
    propertyValues[1].Value = "writer_pdf_import";
    Object oDocToStore =
        DocumentConverter.xCompLoader.loadComponentFromURL(
            sUrl, "_blank", 0, propertyValues);
    // Getting an object that will offer a simple way to store
    // a document to a URL.
    com.sun.star.frame.XStorable xStorable =
        UnoRuntime.queryInterface(
            com.sun.star.frame.XStorable.class, oDocToStore );
    // Preparing properties for converting the document
    propertyValues = new com.sun.star.beans.PropertyValue[2];
    // Setting the flag for overwriting
    propertyValues[0] = new com.sun.star.beans.PropertyValue();
    propertyValues[0].Name = "Overwrite";
    propertyValues[0].Value = Boolean.TRUE;
    // Setting the filter name
    propertyValues[1] = new com.sun.star.beans.PropertyValue();
    propertyValues[1].Name = "FilterName";
    propertyValues[1].Value = "MS Word 2007 XML";
    // Appending the favoured extension to the origin document name
    int index1 = sUrl.lastIndexOf('/');
    int index2 = sUrl.lastIndexOf('.');
    String sStoreUrl = sOutUrl + sUrl.substring(index1, index2 + 1)
        + DocumentConverter.sExtension;
    // Storing and converting the document
    xStorable.storeToURL(sStoreUrl, propertyValues);

dklee · August 12, 2023, 6:42am

Thank you so much. It runs perfectly.

I’m hesitant to ask about this, but since the level of execution results is like this (attached), I’d like to ask if there is a way to heal or a good guide.

TEST_DOC.docx (8.0 KB)

Thank you again.

ajlittoz · August 12, 2023, 7:03am

Your TEST_DOC.docx is still far from being a text document. It contains one empty paragraph and many drawing shapes (text boxes) independent from each other (and some overlap). Since the font you used is not installed on my computer, text is displayed in a substitute font and overflows the limits of the boxes. As is frequent in PDF “conversion”, every line ended up as a text box. Such a result is unusable for contents editing.

The only reliable method is to open your original PDF in a PDF reader. Copy contents as text (there are two copy modes in usual PDF readers: graphical and text; be sure to select text mode), then paste into Writer or any other editor. I’d recommend to paste as unformatted. After that, rebuild the paragraph structure by eliminating non-relevant paragraph marks at end of nearly all lines.

I know you’d like to do this with a command line but this is not possible if you really intend to get a reliable result. Otherwise, find an OCR program.

ajlittoz · August 5, 2023, 6:20pm

Although you named your document with .docx extension, it is by no way a “text” document. When I launch it, it opens in draw, meaning it is a graphical file.

Inspecting it with a binary editor, I see that it starts with %PDF 1.6.

Conclusion: the data has not been converted. It may just have been compressed a bit because there are huge differences in contents. Your original .pdf is %PDF 1.5 with many more sub-elements. Your “converter” probably took the original pages to merge them in a single (graphical?) block though I see one text block per line.

Wanderer · August 6, 2023, 12:17am

as the selected filter is writer_pdf_Export I’d say this “conversion” is done as ordered…

ajlittoz · August 6, 2023, 7:41am

@Zizi64: good point. As I don’t practice macros, I didn’t read the code and relied exclusively on the title (assuming requester did a correct programming). I supposed the goal was PDF→ DOCX which was not achieved.

Anyway, it demonstrates once again that changing file extension can lure the OS into launching some application but has no effect on contents. And when this contents is not the expected one, the application rejects it.

The correct answer would be: use an OCR program for an automated conversion, or copy from a PDF reader and paste into Writer for a manual transformation.

anon87010807 · August 6, 2023, 10:36pm

Kind of off-topic, unless you can do a lot more programming: it was easy to open the original PDF in Firefox, select and copy all text to the clipboard and paste it into a blank Writer document, then run Tools - AutoCorrect - Apply to get rid of most wrong paragraph breaks. Of course, if the PDF has the select all and copy functionality disabled, it won’t work.