Turn PDF into editable/searchable document/ebook

asked 2017-08-03

appreciatethehelp gravatar image

updated 2017-08-03

Hey guys n' gals. So recently I started doing some research on converting paperback books into electronic books so that I can search them for key terms or even annotate/edit them. In the course of my search I happened upon an assertion that Microsoft word can be used to complete this task and it got me wondering: can Libreoffice do this too? If so, can anyone recommend (or, failing that (and for a bazillion brownie points) write up) a tutorial/walkthrough on the process involved to make this happen?

If Libreoffice can do it, but there's a method that provides better results (more accurate OCR results/ higher fidelity/quality text results) using other software, could someone in the know please provide a recommendation?

If not, I'll probably just use google drive to do it, but it would be great to compare the process involved with google drive versus Libreoffice to see which one is quicker/easier, as well as the pros and cons between the two.

What is the start point for this - is it actually a paper book; or has it been converted to PDF which isn't simply a "scan to PDF" which is effectively an image? Or has the paper book been scanned and converted to text via OCR?

robleyd ( 2017-08-03 ):

If your PDF file contains a text layer, you can import it to LibreOffice. But it will be handled by Draw, not by Writer. See my answer below.

gabix ( 2017-08-03 ):

@ Robleyd my starting point is a paperback book, I haven't scanned it to make a PDF file yet, nor have I converted sed image to text via OCR software. I have read that (scanning the pages to make PDF files then converting them to text using OCR software) is the best way to do it (it may well be the only way as far as I know).

So, to be clear, I am not searching for information to help fill in a process I have already started, rather I am searching for recommendations from those who have...

appreciatethehelp ( 2017-08-03 ):

…had experience doing this as to the method (from start to finish) that produces the most accurate, quality results (which I prioritize over the fastest way).

appreciatethehelp ( 2017-08-03 ):

Well, I know no other method than carefully reading and correcting the text. I’ve outlined the three major steps on this road, but the details may vary depending on the tools available to you. Thus, the simplest is:

  1. Scan and recognize the text.

  2. Export to a text processor format (ABBYY FineReader of latest versions can produce ODT files).

  3. Load to LibreOffice, check spelling and correct OCR errors (OOoFBTools will help you).

  4. Export to FB2.

Anyway, it will be quite time-consuming.

gabix ( 2017-08-04 ):

answered 2017-08-03

gabix gravatar image

What you are asking about is actually several different tasks.

First, scanning. Refer to your scanner manual and accompanying software.

Second, recognizing to produce editable texts. Here, I second Mike Kaganski. ABBYY FineReader seems to be the best solution.

Third, converting to e-books. Well, of course, you can stay with text processor formats such as ODT or DOCX, but a special format is a better solution. I strongly recommend using OOoFBTools. This extension has a load of features to correct OCR errors and brush up formatting in order to produce a nice e-book in the FictionBook2 format. You can also use Calibre, an excellent program for e-library management and format conversion.

edit flag offensive delete link more

answered 2017-08-03

You seem to refer to Microsoft Office Document Imaging, an OCR which used to be bundled with MS Office up to 2007 edition, but removed from 2010 and absent since then.

LibreOffice doesn't bundle any OCR. Personally I would recommend a commercial solution from ABBYY - I happened to work with it, and was pleased with its fidelity. There are a number of free solutions as well.

edit flag offensive delete link more
