I’ve been working with LibreOffice to convert some HTML documents to DOCX, and a curious problem cropped up. Some of the files would place what looks like the HTML renderer output as the text instead of actually finishing the conversion. (I’m omitting that text for brevity.) Through some trial and error, I (think) I found the culprit. Consider the following:
<div> word——another word</div>
Converting that using:
loffice --headless --convert-to docx mincase.html
… produces the issue I’m talking about, at least in v5.0.0.5. Two emdashes in the same file, seemingly regardless of how they are separated, cause essentially raw HTML to be used as content instead of the rendered content itself.
What is particularly perplexing is that opening the minimum case with LibreOffice directly produces a fully rendered, proper version of what the minimum case should look like:
word——another word
This leads me to my questions:
- Is this a legitimate bug?
- What does opening with LibreOffice do differently when converting?
- Can I mimic that behavior from the command line?