Ask Your Question
0

pdf to Word puts boxes around each line of text

asked 2018-03-17 01:08:57 +0100

ebbandari gravatar image

Hi everyone, We tried converting pdf to Word using 5.4.5. It worked but every line -- that is a line that starts from the left hand side goes to the right hand side -- has a box around it!!! Each line, with a box that went from one side of the page to another. We cannot have that, because we are trying to process the text and open office does not like the boxes. Here is the command that we used:

/opt/libreoffice5.4/program/soffice --headless --infilter="writer_pdf_import" --convert-to doc someFile.pdf

doc, docx, and rtf, all had the same problem. Questions: 1) are we missing a filter? 2) is there a program that can post process this and get rid of the boxes? 3) we are going to try 6.0.2, but was there a version ever that did not have the boxes? Thank you for your help, Esfandiar

edit retag flag offensive close merge delete

Comments

If you are going to process the file with OpenOffice, why converting PDF to DOC? Have you tried from PDF to ODT? You can save to DOC once you finish editing. Also, is possible to select all text and remove lines with formating. Do you have access to Word, maybe you should try that with Word? You convert PDF which is bad base for editing and you convert to DOC which is not native LO's format and you plan to edit it with OpenOffice... Stick to LO and ODT while you edit.

Kruno gravatar imageKruno ( 2018-03-17 09:54:26 +0100 )edit

LibreOffice cannot convert PDF text into plain body text but into lines. If you use the clipboard every line is inserted as paragraph. You need a professional/proprietary OCR program like Abbyy Finereader or IRIS. Abbyy Finereader has got a Linux version but you have to pay for that.

Every free OCR software which I got to know was not efficient.

Grantler gravatar imageGrantler ( 2018-03-17 11:52:35 +0100 )edit

If you are just after then text document, you can in most pdf readers use ctrl-A to select all the text and then paste into LO. You may also want to use Zamzar file conversion site https://www.zamzar.com/

AdmFubar gravatar imageAdmFubar ( 2018-03-19 20:24:04 +0100 )edit

I am trying to do this automatically on a sever, i.e., I have pdf's to convert them to docx. I did not try converting it to odt or LO. Thanks @Kruno. The odt server is very slow -- about 4 seconds -- so not good option. Also if converting to LO, it will probably put boxes around each line again. I think this a bug to not use a ruler even though all boxes are the same size!?!
Converting to RTF gives editable text, but finding the boxes to remove and then have one ruler is the challenge.

ebbandari gravatar imageebbandari ( 2018-03-21 01:36:51 +0100 )edit

1 Answer

Sort by » oldest newest most voted
0

answered 2018-05-26 06:35:11 +0100

Xoristzatziki gravatar image

You must understand that PDF files primary include positions and fonts for every chunk of text. PDF format is not an exact text processing format but a graphical design format.

This is the exact reason that exporting to PDF from LibreOffice has an option for Hybrid PDF (embed LO Document), in case you want the PDF to be capable to be re-processed.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

2 followers

Stats

Asked: 2018-03-17 01:08:57 +0100

Seen: 562 times

Last updated: May 26 '18