OCR program created ODF file with un-editable boxes around every bit of text

I used an OCR program to convert a PDF file into an ODT file. The ODT file was opened up in LibreOffice Writer and it had a lot of mistakes that needed editing. However, when I try to select anything with the mouse, it immediately draws a box around the text, usually only one paragraph at a time, and then moves the box, including the inner text, instead of selecting it. The box has little squares around the corners that can be grabbed with the mouse and the box can be resized. If I right click inside the box, there is a context menu that will even let me rotate the box with the text in it. The context menu has several options in it, but “Get Rid Of Stupid Box” is not one of them.

So the page cannot be edited in this crippled format. I tried saving the file as a text file, but the resulting file was completely blank.

I also tried to use Select All to try to copy and re-paste it as unformatted text, but CTRL-A doesn’t work and EDIT->Select ALL is grayed and inactive.

If I double click text, a different box appears with no handles. When that happens, I can go to FORMAT->CLEAR DIRECT FORMATTING (otherwise it is gray and unselectable) and click that, but then nothing happens and the box is still around the text.

I’m going to retype all the text manually since the document is only 8 pages long, but I would like to know how to fix this bug for the next time this happens. Anybody else encounter this bug? How to get around it without retyping everything?

1 Like

Look at the options for exporting the document from the OCR program. You want to find an option to export without layout or as text only.

When I ocr a document in my ancient OCR program, I export as Word but plain text (in fact as rtf) I then copy everything and paste into Writer As unformatted text.

I then select all and apply Body Text format. After that, I go through and apply headings where needed.

With your document, you could export it to pdf. Open the PDF in Draw, select all the text, click Shape - Consolidate text. Copy the consolidated text and paste into Writer as unformatted text. Repeat for each page. Apply styles to the text

1 Like

As LibreOffice has no OCR, I’d suggest looking at your (unknown/not named) OCR-software first. From your description, it is quite likely a setting in your OCR-software (and maybe even a bug). If Writer recieves “letters in little boxes” it will display them…

The option is called “Delete”, but I guess that is not what you seek.
.
You didn’t ask for help. If you are interested, you may upload a shortened version of your .odt for inspection here.

1 Like

Thanks, I use gImageReader which is based on Tesseract. I used the Export Text File and it did output the unformatted text as you described. The problem is that it does not retain the columns. It did recognize them, instead of scrambling them together, so it is possible to create new columns and copy them in there. It still a lot of manual intervention, but at least I’m not having to retype every single word.

1 Like

I use gImageReader and Tesseract, Yes, Delete will erase the box and the text. If I click outside the box and hit Delete, it erases the whole page text and all. So Delete is not very useful here.

I was unable to get a sample version of the file. It wont let me select pages now so I can’t delete them. I’m no longer sure how I deleted them before. The cursor keys do not move to a new page. If I scroll to the bottom with the mouse scroll bar, as soon as I click anything, it snaps back to the upper left top page.

I’m going to try to ocr just one page and see if that works. I will post that later. [EDIT-Done]

Yes, I would like help finding a way to remove these boxes, while still keeping the text and overall format.
sample.odt (12.0 KB)

1 Like

Form the text, it looks as if the original might have been a table, action verbs in column 1 and description in column 2. You might have to specify that some recognition areas are tables in the OCR program if that is the case. I have not used gImageReader so am not sure of what the user can do.

1 Like

No that sample page is just a single column page with no tables. If you click on the sentences, the boxes will appear.

But yes, part of the blame goes to the OCR program for not generating a proper ODT file in the first place. Then again, LibreOffice shouldn’t allow whatever it did as a valid format. Tesseract is the most accurate one I’ve used, but not very customizable.

1 Like

In gImageReader

  • you could strip line breaks to merge more text boxes. This is similar to my earlier suggestion of consolidate text but at the processing stage before export from gImageReader. You will still need to copy from text boxes but the fewer the better.
    gImageReaderStripLineBreaks
  • Or you could define manual recognition areas but it looks to be a bit annoying as you have to keep Ctrl pressed to add each area in the entire document. The idea is that you need only one recognition area per column. I see that you cannot define tables or image separately.

Because what? One can create as complex / unusual documents as one wants. There is nothing technically wrong in such documents. They don’t match your expectations/needs - it’s a pity, but not a reason to consider that illegal.

Blame your OCR program.

The resulting “.odt” file contains absolutely no text (in the ODF definition of “text”). It is exclusively made of text boxes which are labelled rectangles, i.e. special graphical objects. They are external to text management and not handled by it. Consider them as images, though you can ultimately select the label and then paste it into the text flow.

Your document is kind-of a photo in “non-image” format. It is half-way between a scan and a true text document.

Follow @EarnestAl’s recipe with Draw to turn it into something manageable by Writer. Or retype everything if multi-column text is mangled.

Yes, I can do that. The OCR program may be having difficulty in understanding that not all scans are perfect. Some are skewed by the way they fed into the scanner. I suspect that these text boxes were its way to recreate a skewed document rather than just a plain document.

The OCR actually did very well with columns, but failed with tables in other scans.

Still, while it is clearly the OCR’s fault for generating a bad ODT file, Libreoffice should have a reasonable way to un-text-box stuff that doesn’t require a lot of manual editing. At least some way to merge text boxes. I have read that this kind of thing can also happen when saving web pages in ODT format.

But for now, it looks like the only solution is to save the OCR text in plain text mode. I tried saving in PDF and then copy->paste without formatting, but its not much faster. I didn’t have any problems with re-columnizing the text with Libreoffice. That went very smooth, but re-tableizing text from another document proved to be a fail due to the way the OCR saved the text. Could be done, but would take forever.

Unfortunately, text boxes, more generally drawing objects, are provided in Writer to escape from the “text flow paradigm”. I.e. this is a feature to add totally unrelated stuff to a logically organised flow. Writer is built around this notion of text flowing along a unidirectional thread.

As already suggested, the best tool to try and get something relevant out of a PDF is Calc where you can manipulate the text boxes. Not straightforward, though.

The manual for gImageReader clearly states that it saves the layout in hOCR format, that is, the text boxes you see. Best to export without layout and to use Writer to do the layout.

It seems there is also a way to get the “plain text”, if I read the following resource correctly - but it may require some dreaded RTFM.

https://publish.illinois.edu/commonsknowledge/2020/11/05/free-open-source-optical-character-recognition-with-gimagereader/

But you are welcome to create an extension to “clean any messy file” according to n4mwd-rules.

The blame lies squarely with your ocr software, which appears to incapable of handling columnar formatted text.

As you are using Linux, I’m not surprised. Foss GUI OCR software offerings on Linux often lack all of the useful functions that you find in proprietary ocr software for other OSes, a situation I have moaned about on several occasions elsewhere. I mostly use Abbey Finereader on Mac for work, which produces accurate ocr output and also a fairly decent ODF output save, but when I’m on my Linux workstation, the whole thing quickly becomes an exercise in tediousness and mediocrity. I’ve yet to find a Linux version that is as easy to use, and produces decent export formatted documents (whether Word, Excel, or ODF). LibreOffice can only deal with the export document provided by the ocr software. If that export is all text boxes rather than the original format, that is what you have to work with.

Yes, that is its private pseudo-proprietary format. You have to use EXPORT to get anything out of it. It will export plain text (the only thing that works properly, but lacks formatting), PDF, and ODT.

The bottom line is that the only practical output is plain text, but then I have to add the proper formatting back in.

If it is any consolation, IrisReader also creates those text boxes if you choose anything except without layout.

On the other hand, it works well for text under pdf.

If you use styles logically, then reformatting the text is not a big job. I see that gImageReader doesn’t do tables so you will need time to rebuild tables

pseudo-proprietary??

Obviouisly there are some users who want hORC…
From tel link I cited above:

You can export this to a text file or copy and paste it into another program. This may be useful in some cases, but if you want to export a searchable PDF, you will need to use hOCR, PDF mode

Yes, as it in its open source, but not really used by anything other than OCR programs. If you go into gInageReader and click Save, you get a hOCR file. So it is possible for a 3rd party with more energy than me to write a program that will interpret the hOCR file and output a properly formatted ODT file. Right now the generated ODT files are unacceptable. In addition to text boxes, they contain multiple font sizes, font styles, character and line spacings where no such things existed on the original. So yes, plain text is the only acceptable output until such things are fixed…