Converting docx in headless mode hangs

My application runs on a Ubuntu 16.04 web server where uploaded files automatically get’s converted to PDF with doc2pdf which is part of unoconv which again uses Libreoffice in headless mode. When trying to convert the corrupted DOCX document hangs with 100% of the CPU utilized and eventually I have to reboot to recover.

I want to one of these things to happen:

  • Command exits with an error
  • Set a timeout and if reached, the command
  • exits with an error
  • Be able to detect if DOCX document is broken

Setting the --timeout on doc2pdf doesn’t help.

When I try:

doc2pdf works.docx

I get this and then a prompt:

$ doc2pdf works.docx
W: Unknown node under /registry/extlang: deprecated
W: Unknown node under /registry/grandfathered: comments
W: Unknown node under /registry/grandfathered: comments
Fontconfig warning: ignoring UTF-8: not a valid region tag
$

When I try (-vvv means verbose debugging mode):

doc2pdf -vvv broken.docx

I get this and then hangs forever:

$ doc2pdf -vvv broken.docx
Verbosity set to level 3
Using office base path: /usr/lib/libreoffice
Using office binary path: /usr/lib/libreoffice/program
DEBUG: Connection type: socket,host=127.0.0.1,port=2002,tcpNoDelay=1;urp;StarOffice.ComponentContext
DEBUG: Existing listener not found.
DEBUG: Launching our own listener using /usr/lib/libreoffice/program/soffice.bin.
LibreOffice listener successfully started. (pid=2940)
W: Unknown node under /registry/extlang: deprecated
W: Unknown node under /registry/grandfathered: comments
W: Unknown node under /registry/grandfathered: comments
Fontconfig warning: ignoring UTF-8: not a valid region tag
Input file: broken.docx

Pressing CTRL + C will exit with this error:

^Cunoconv: SystemError during import phase:
Couldn't instantiate python representation of structured UNO type com.sun.star.lang.DisposedException
Traceback (most recent call last):
  File "/usr/bin/doc2pdf", line 1278, in <module>
    die(exitcode)
  File "/usr/bin/doc2pdf", line 1131, in die
    if convertor.desktop.getCurrentFrame():
uno.DisposedException: Binary URP bridge already disposed

Trying to convert a normal DOCX document with Libreoffice headless directly works fine:

libreoffice --headless --convert-to pdf works.docx

Trying to convert a corrupted DOCX document with Libreoffice headless directly doesn’t work:

libreoffice --headless --convert-to pdf broken.docx

The output for both looks like this. Only difference is when trying to convert the corrupted DOCX document it hangs:

javaldx: Could not find a Java Runtime Environment!
Warning: failed to read path from javaldx
W: Unknown node under /registry/extlang: deprecated
W: Unknown node under /registry/grandfathered: comments
W: Unknown node under /registry/grandfathered: comments
Fontconfig warning: ignoring UTF-8: not a valid region tag
convert /home/forge/broken.docx -> /home/forge/broken.pdf using filter : writer_pdf_Export

Just file a bug report with the sample document attached.