emdash conversion issues from html

I’ve been working with LibreOffice to convert some HTML documents to DOCX, and a curious problem cropped up. Some of the files would place what looks like the HTML renderer output as the text instead of actually finishing the conversion. (I’m omitting that text for brevity.) Through some trial and error, I (think) I found the culprit. Consider the following:

<div>&nbsp;word&mdash;&mdash;another word</div>

Converting that using:

loffice --headless --convert-to docx mincase.html

… produces the issue I’m talking about, at least in v5.0.0.5. Two emdashes in the same file, seemingly regardless of how they are separated, cause essentially raw HTML to be used as content instead of the rendered content itself.

What is particularly perplexing is that opening the minimum case with LibreOffice directly produces a fully rendered, proper version of what the minimum case should look like:

word——another word

This leads me to my questions:

  1. Is this a legitimate bug?
  2. What does opening with LibreOffice do differently when converting?
  3. Can I mimic that behavior from the command line?

Hi ross

From your report then yes, it’s a bug.

LO-5 is from LibreOffice Fresh & as such may be buggy. You need to get used to using Bugzilla (or the Bug Assistant).

One alternative is to use LibreOffice Still (more stable, less exciting).

With LO there is nothing to stop you from using the actual utf-8 Unicode character - as long as the font in use contains the character glyph(s).

  • ‒ (U+2012) FIGURE DASH
  • – (U+2013) EN DASH
  • — (U+2014) EM DASH (may be used in pairs to offset parenthetical text)
  • ― (U+2015) HORIZONTAL BAR (alias: quotation dash) (long dash introducing quoted text)

If this helps then please tick the answer (:heavy_check_mark:)

…and/or show you like it with an uptick ()