Editing/correcting OCR'd text in PDF with Draw

Phoggbank · September 1, 2022, 5:37pm

Using Draw, I am trying to correct errors in a PDF containing text created from an OCR process. I tried to follow the segment starting at 6:45 in:

The author shows how create a second window containing the OCR test by selecting and dragging original image to the right. But when I do this, the second window, which should contain text, is completely blank. What am I doing wrong?

Grantler · September 1, 2022, 6:14pm

Make sure, that the OCRed file contains the 2 layers. Attached you find a sample file. A png image was OCRed by www.pdf24.org. The resulting PDF file contains the text layer and the graphics layer. You can easily separate the graphics layer.

HaengenderEinzug_Fussnote.pdf (36.6 KB)

Original Writer file:
HaengenderEinzug_Fussnote.odt (25.5 KB)

ajlittoz · September 1, 2022, 6:09pm

IMHO, Kevin didn’t give full details about the original file. It isn’t a simple PDF. It contains an image (the top layer) with white background and some text boxes below (the bottom layer) created by the OCR application.

You can’t follow the same procedure unless your OCR-file has the same structure. And this depends on the OCR software.

But, from experience with an OP on this site, PDF is not the best format to amend OCR files. You should save plain text (yes, you lose formatting information) and proceed through some macro generator. You write correction rules as macros and you launch the macro-generator. However, not all macro-generators will allow for “comfort”. Best choice is for 2-layer macro-gnerators, i.e. those which operates not on characters but on “tokens” (meaningful group of characters) because you’re handling natural language and natural language is based on words. Using “tokens” intermediate form allows to ignore easily spaces and other “noise” sequences.