How can I extract metadata, styles and absolute positions of form elements using pyUNO?

benzea · August 8, 2014, 3:50pm

Hi,

I am developing the project SDAPS (http://sdaps.org). Right now the project reads in both the ODT file and PDF file, to extract metadata (from style names) and positions of text fields (by parsing the PDF file). I would like to extract this information directly from LibreOffice using python3-uno.

Ideally, one would develop a new way to describe these forms (it would be neat if it also was a PDF form). However, there are some constraints:

Check-/Radioboxes should have a specific border width (exactly 1pt)
The absolute position needs to be exported (one way might be by using XML layout export information)
export the text belonging to a checkbox
multicolumn layout/right to left languages should work
include more metadata (i.e. assigning variables/values to checkboxes for choice/range questions)

In general, it would seem sane to use normal form elements. However, it does not seem possible to specify exactly how they should be drawn. Right now we are using empty text fields with a known size and border.

An example (document) can be found at:

I’ll be happy to answer more questions about this. I would love to hear ideas about how this could be done nicely. An important aspect is of course to make it simple to use; the style based system right now is rather hard to use, and even worst to debug.

bencomp · August 9, 2014, 2:47pm

Interesting problems. Could you add a little information from the project website, so that users don’t need to visit the project page (which may not be around forever)? Also, on a more project-related note: LibreOffice is pretty versatile and can create forms in PDF that work, I would have chosen different tools. XSL:FO comes to mind - define a form in XML with questions (variables) and possible answers, use Apache FOP to convert to a layout and PDF and go from there.

benzea · August 9, 2014, 10:42pm

I can add some general information later on.

The whole point is to make the layout creation simple. Obviously one can create forms like this in other ways; that is pretty straight forward, but not something for your normal user (see LateX support).

The thing is, a normal user expects a WYSIWYG editor; and doing some magic with LibreOffice seems like a relatively sane way to get a good WYSIWYG editor that works in the way that many users expect. The user experience has to be the central point.

oweng · August 9, 2014, 1:49pm

This is not an answer and I doubt you will get much in the way of any answer unless the question is made more specific (currently it is quite general). I imagine it is possible to use Python to access the Universal Network Objects (UNO) component model to step through a list of used styles (in similar manner to this example in Basic from the Apache OO wiki). It is less clear to me how positional information of such objects might be accessed, especially as the concept of a “page” is something handled by the rendering / layout engine of Writer.

Your entire project seems geared towards positional information of form elements being fairly fixed (encoded) and readily extractable to determine meaning. This would seem a somewhat difficult approach to extracting information from forms, which is more typically based around element identifiers providing meaning and positional information being largely irrelevant. I would think that PDF is a much better format for storing / accessing positional information via an API than ODF.

benzea · August 9, 2014, 10:46pm

Yes, the absolute position is essential to do the later data processing. I have looked into accessing the information using UNO, but it appears that the API is not complete in this regard.

So, if one really wants to get this working, then one probably needs some hacks (e.g. use the XML export, and include metadata as hidden text; maybe even exporting twice). Though I wouldn’t mind that, as long as it is stable.

The current approach is to parse the PDF, however this is far from perfect.

AlexKemp · March 1, 2016, 8:25pm