pdf to Word puts boxes around each line of text
Hi everyone, We tried converting pdf to Word using 5.4.5. It worked but every line -- that is a line that starts from the left hand side goes to the right hand side -- has a box around it!!! Each line, with a box that went from one side of the page to another. We cannot have that, because we are trying to process the text and open office does not like the boxes. Here is the command that we used:
/opt/libreoffice5.4/program/soffice --headless --infilter="writer_pdf_import" --convert-to doc someFile.pdf
doc, docx, and rtf, all had the same problem. Questions: 1) are we missing a filter? 2) is there a program that can post process this and get rid of the boxes? 3) we are going to try 6.0.2, but was there a version ever that did not have the boxes? Thank you for your help, Esfandiar
If you are going to process the file with OpenOffice, why converting PDF to DOC? Have you tried from PDF to ODT? You can save to DOC once you finish editing. Also, is possible to select all text and remove lines with formating. Do you have access to Word, maybe you should try that with Word? You convert PDF which is bad base for editing and you convert to DOC which is not native LO's format and you plan to edit it with OpenOffice... Stick to LO and ODT while you edit.
LibreOffice cannot convert PDF text into plain body text but into lines. If you use the clipboard every line is inserted as paragraph. You need a professional/proprietary OCR program like Abbyy Finereader or IRIS. Abbyy Finereader has got a Linux version but you have to pay for that.
Every free OCR software which I got to know was not efficient.
If you are just after then text document, you can in most pdf readers use ctrl-A to select all the text and then paste into LO. You may also want to use Zamzar file conversion site https://www.zamzar.com/
I am trying to do this automatically on a sever, i.e., I have pdf's to convert them to docx. I did not try converting it to odt or LO. Thanks @Kruno. The odt server is very slow -- about 4 seconds -- so not good option. Also if converting to LO, it will probably put boxes around each line again. I think this a bug to not use a ruler even though all boxes are the same size!?!
Converting to RTF gives editable text, but finding the boxes to remove and then have one ruler is the challenge.