pdf to Word puts boxes around each line of text

ebbandari · March 17, 2018, 12:08am

Hi everyone,
We tried converting pdf to Word using 5.4.5. It worked but every line – that is a line that starts from the left hand side goes to the right hand side – has a box around it!!! Each line, with a box that went from one side of the page to another. We cannot have that, because we are trying to process the text and open office does not like the boxes.
Here is the command that we used:

/opt/libreoffice5.4/program/soffice --headless --infilter="writer_pdf_import" --convert-to doc someFile.pdf

doc, docx, and rtf, all had the same problem.
Questions:

are we missing a filter?
is there a program that can post process this and get rid of the boxes?
we are going to try 6.0.2, but was there a version ever that did not have the boxes?
Thank you for your help, Esfandiar

Kruno · March 17, 2018, 8:54am

If you are going to process the file with OpenOffice, why converting PDF to DOC? Have you tried from PDF to ODT? You can save to DOC once you finish editing. Also, is possible to select all text and remove lines with formating. Do you have access to Word, maybe you should try that with Word? You convert PDF which is bad base for editing and you convert to DOC which is not native LO’s format and you plan to edit it with OpenOffice… Stick to LO and ODT while you edit.

Grantler · March 17, 2018, 10:52am

LibreOffice cannot convert PDF text into plain body text but into lines. If you use the clipboard every line is inserted as paragraph. You need a professional/proprietary OCR program like Abbyy Finereader or IRIS. Abbyy Finereader has got a Linux version but you have to pay for that.

Every free OCR software which I got to know was not efficient.

AdmFubar · March 19, 2018, 7:24pm

If you are just after then text document, you can in most pdf readers use ctrl-A to select all the text and then paste into LO.
You may also want to use Zamzar file conversion site
https://www.zamzar.com/

ebbandari · March 21, 2018, 12:36am

I am trying to do this automatically on a sever, i.e., I have pdf’s to convert them to docx.
I did not try converting it to odt or LO. Thanks @kruno. The odt server is very slow – about 4 seconds – so not good option. Also if converting to LO, it will probably put boxes around each line again.
I think this a bug to not use a ruler even though all boxes are the same size!?!
Converting to RTF gives editable text, but finding the boxes to remove and then have one ruler is the challenge.

Xoristzatziki1 · May 26, 2018, 4:35am

You must understand that PDF files primary include positions and fonts for every chunk of text. PDF format is not an exact text processing format but a graphical design format.

This is the exact reason that exporting to PDF from LibreOffice has an option for Hybrid PDF (embed LO Document), in case you want the PDF to be capable to be re-processed.

erik · February 19, 2020, 12:11pm

The best I got was using pdftohtml and then open this file within oowriter (using the file open dialog). Then I saved it as odt, doc or docx.

pdftohtml is part of the poppler-utils package on Fedora Linux.

You can also try Calibre, which is a GUI for managing and converting ebooks. Also capable of converting to docx.

mariosv · February 19, 2020, 1:28pm

I don’t know if can be useful, with 6.4 there is an option in draw, to consolidate the text in selected pdf boxes in one text box.
Menu/Shape//Consolidate text.
Mainly avoid the need to copy the text for every text box.