Hello everyone, I realize that this question might be a bit out of the ordinary and I’m more looking for guidance / art of possible.
I’ve been developing an OCR tool as a side project and as part of it, a table detection model. In order to do that, I needed to build an acceptable set of training data. My process right now is to use LibreOffice to transform documents to images, then I manually inspect and annotate the table bounding boxes in the images. It’s been working ok for now, but it is, of course, very time consuming. I was looking at ways to automate that process. Since my documents are all in odt or word format, and since I’m already using LibreOffice, I was wondering if LibreOffice could export a combination of images and a file with the different elements (characters / tables / shapes, …) and their position on a page.
I know that LibreOffice doesn’t do it by default (unless there’s is a file format that I missed when I looked at what’s available to export), but given LibreOffice is rendering the document, I could, conceptually, piggyback the rendering algorithm to export that information. However, I’ve been browsing the code and I’m not sure where to start. I also know that LibreOffice has extensions.
So, if I wanted to create a new exporter to export the image representation of pages and a file with positions of elements would it be better do an extension or fork the entire project? In both cases, where should I start to look?