Ask Your Question
0

emdash conversion issues from html [closed]

asked 2015-08-13 06:27:04 +0100

ross gravatar image

updated 2020-08-26 13:05:09 +0100

Alex Kemp gravatar image

I've been working with LibreOffice to convert some HTML documents to DOCX, and a curious problem cropped up. Some of the files would place what looks like the HTML renderer output as the text instead of actually finishing the conversion. (I'm omitting that text for brevity.) Through some trial and error, I (think) I found the culprit. Consider the following:

<div>&nbsp;word&mdash;&mdash;another word</div>

Converting that using:

loffice --headless --convert-to docx mincase.html

... produces the issue I'm talking about, at least in v5.0.0.5. Two emdashes in the same file, seemingly regardless of how they are separated, cause essentially raw HTML to be used as content instead of the rendered content itself.

What is particularly perplexing is that opening the minimum case with LibreOffice directly produces a fully rendered, proper version of what the minimum case should look like:

 word——another word

This leads me to my questions:

  1. Is this a legitimate bug?
  2. What does opening with LibreOffice do differently when converting?
  3. Can I mimic that behavior from the command line?
edit retag flag offensive reopen merge delete

Closed for the following reason question is not relevant or outdated by Alex Kemp
close date 2020-08-26 13:04:07.870993

1 Answer

Sort by » oldest newest most voted
0

answered 2015-10-06 21:01:21 +0100

Alex Kemp gravatar image

Hi ross

From your report then yes, it's a bug.

LO-5 is from LibreOffice Fresh & as such may be buggy. You need to get used to using Bugzilla (or the Bug Assistant).

One alternative is to use LibreOffice Still (more stable, less exciting).

Workaround:
With LO there is nothing to stop you from using the actual utf-8 Unicode character - as long as the font in use contains the character glyph(s).

  • ‒ (U+2012) FIGURE DASH
  • – (U+2013) EN DASH
  • — (U+2014) EM DASH (may be used in pairs to offset parenthetical text)
  • ― (U+2015) HORIZONTAL BAR (alias: quotation dash) (long dash introducing quoted text)

If this helps then please tick the answer (✔)
...and/or show you like it with an uptick ()

edit flag offensive delete link more

Question Tools

1 follower

Stats

Asked: 2015-08-13 06:27:04 +0100

Seen: 84 times

Last updated: Oct 06 '15