Mulitple bugs in Writer to HTML conversion

Multiple bugs were detected in an .odt to HTML conversion. The odt file is attached. I can not upload the x.html file.

  1. Roman Numeral Chapter number is unrecognized.

nstead of "<h1>II thing"</h2> the output is <h1>IVThing</h1>. Note that the chapter number “II” and the title are not separated by a space.

  1. Alphabetic header numbering is not recognized.

Instead of <h2> b thing </h2> I get <h2>bthing</h2>.

  1. Incorrect indentation of text following a header.

Instead of <p>&nbsp&nbsp&nbsp&nbsp;see attachment</p> or <p style = "margin_left: xxpx>see attachment</p> I get <p style="margin-top: 0.05in"> See Attachment</p> and the spaces are ignored by browsers

This is not an error but why is the DOCTYPE HMTL 4.0 transitional instead of HTML 5.0?

Edited by ajlittoz for better formatting

thing.odt

Writer is a Word Processor not an HTML Editor. to use html5 you have to download the extension here: https://extensions.libreoffice.org/@@search?SearchableText=HTML5 or use a real HTML Editor like the one at http://www.bluegriffon.org
Writer can not by design create HTML 5 Documents until you can get the Developers to update that option you will either need the extension I link to or a real HTML Editor not something that just exports to HTML.
If this helps then please tick the answer (:heavy_check_mark:) …and/or show you like it with an uptick (∧)

Writer is a Word Processor with an option to convert and output an HTML file. Take a look at the File->Save As options. Since HTML files can be opened, the Word Processor can be used as an HTML editor. I may be wrong but it seems to me that if you and (1) output an HTML file,and (2) input an HTML file, and (3) modify an HTML file, then “writer” is an HTML editor.

No, Writer is not an HTML editor (in my definition of the term).

When you open a document, HTML is interpreted and converted into an ODF representation. You then edit a “standard” text document with Writer commands and styles. When you save as HTML, the ODF is converted into HTML with a break through between HTML elements and CSS definitions (in my opinion, this clutters the HTML because formatting has no obvious correspondence between the concepts).

You have no direct access to the HTML tags (this is where I’d put “HTML editing”, not to the “high level” equivalent through styles).

And you must not forget that an HTML sequence of spaces is replaced by a single space by all browsers unless you enclose text in <pre> ... </pre>, which you cannot enter with Writer.

You might find this extrnsion useful Writer2xhtml » Extensions

  1. No space after chapter numbering

You must configure the properties of chapter numbering with Tools>Chapter Numbering. The settings you may have customised in Writer are not forwarded to the HTML editor. Consider they are separate applications settings-wise.

  1. Alphabetic numbering

    Same as 1. above. Configure all your levels, notably the separator before and after parameters.

  2. Indentation

    If indent is made with NBSP, those NBSP are present in the resultant HTML and rendered as expected.

    However, if you use ordinary spaces for your indent, those spaces are also present in the HTML (Writer takes no initiative about that and you can’t blame it), but W3C says that any sequence of ordinary spaces is considered as a single space and browsers render it accordingly (eventually stretching ALL spaces in a line to justify adge to edge). This is intended behaviour in browsers.

    Anyway, in Writer or browsers, it is always a bad idea to indent paragraph with spaces. It is much better, safer and reliable to use styles (or as a fallback, cursors in the horizontal ruler). This will translate to CSS and give highly predictable results.

Note that there are far less built-in styles in Writer/HTML component than in Writer. But you can always define custom ones for your needs. They will translate to CSS to be rendered in a portable way by browsers.

To summarise: no bug in Writer/HTML but a misconception of yours about differences between HTML edition and document edition (and also a different base configuration in those components).

EDIT 1

CAUTION! Save as .odt or save as .html from an .odt is not the same as File>New>HTML Document then save!

In the first case, you edit a sophisticated formatted document which is converted into an HTML one with embedded CSS directives trying to mimic Writer styles. In the second case, you start directly into HTML formatting with simplified styles corresponding to W3C standards. This could explain the differences. I conducted my test in the second context.

However, in both cases, you don’t edit directly HTML markup. You still control Writer through its styles (full or reduced capabilities) and these styles are translated, perhaps not as you would like, by an export filter (may be approximate; remember: Traduttore, traditore).

EDIT 2020-03-25

I had a look at the attached sample file.

  • It is an .odt file

    This may seem obvious and queer from me. However, as I explained, File>New Text Document and File>New>HTML Document are different and launch different LO components. By creating an .odt document, you format a common paginated to-be-printed document. With File>Save As in .html format, the document will be approximately translated from ODF concepts to HTML concepts.

    The first difference: from pages to continuous stream.

    A second difference, of utmost importance in your case: HTML has no notion of tab stops; consequently tabs are simply ignored (they could have been translated into a space but the output converter has not chosen this path).

  • If you had chosen from start File>New>HTML Document,

    all menus and formatting actions would have been downgraded to HTML capabilities. Have you noticed, the first time you saved as .html, the dialog warned you about the possible loss of formatting effects?

  • How to deal with the issue?

    Ideally, you should restart from scratch. But, considering you already spent a lot of time on the real document, there are several workarounds.

    • Chapter Numbering:

      The fix is a bit “twisted”. Since tabs are ignored by the output filter, set Number followed by Nothing in Tools>Chapter Numbering, Position tab. In Numbering tab, for all used levels, set Sparator After to a space.

    • Indents and vertical spacing

      This must be managed exclusively through paragraph styles. This means that your level dependent left indent requires one style per level. You can’t use Text Body everywhere (e.g. your 'See attachment" paragraphs at level two are aligned with heading 2 title but end up flushed at left margin in HTML because tabs heave been removed).

    • Headings

      I don’t fully understand what you’ve done to Heading 2. In principle, you don’t change indents in the paragraph style but in Tools>Chapter Numbering, Position tab otherwise you may meet inconsistency under quite frequent circumstances (internal conflict with list numbering handling during formatting of the chapter number).

      Also, you seem to have activated bullet/numbered list on Heading 2 while it is managed by Tools>Chapter Numbering. This is another case for internal conflict.

    If you address these points, you should not be very far from your goal.

    Don’t hesitate to request further help (either through question edit or comment under my answer.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer) or comment the relevant answer.

2020-09-19 edit only fixed a formatting error

Thanks for commenting. Let me try to address the issues as you have presented them.

  1. I have taken your advice and configured Chapter Numbering in Tools->Chapter Numbering so that there is a space between the ‘number’ and the title. This has had no effect on the output. The output shows “IIthing” instead of “II thing”.
  2. The issue is moot. The heading2 entry has been replaced with a list.
  3.   was not ever used. The text shows that this was an example output of the HTML generator that would accomplish the required spacing. My original test .odt file, which I belief I uploaded, shows that I did not insert any spaces. LibreOffice positioned the “See Attachment” line directly under the title in the 'Heading 2" item. This placement was not captured by the HTML generator which instead, output a series of spaces preceding the text. As you have rightfully pointed out, consecutive spaces are truncated to a single space. The issue remains as to generator behavior.

This issue is that there are bugs in the HTML generator. The bugs include a failure to recognize odt spacing in heading titles and the failure to recognize automatic paragraph indentation done in .odt files after headings in the following paragraph, and the failure to substitute &nbsp`; for each space of multiple spaces. Not shown in this example is the failure to properly indent lists within chapters.

Sorry about misspellings in the preceding comment (and other errata).

please attach your document to your question for further analysis. For that, use the edit link under your question, then the paper clip tool (position the cursor to some convenient location, e.g. a paragraph by itself separated from the rest by at least a blank line).

Sigh. I hate being wrong. The included file is an example of just about everything. I was wrong in stating the header 2 “See Attachment” line was placed w/o spaces. It was placed with tabs. I don’t know what issues the API has but I do think it is possible to do better. The issues faced are close to those faced by compiler writers, and the solutions are very similar. What seems missing is (1) local optimizations, (2) global grouping, (3) use of odt parameters for things, such as paragraphs, headers, lists etc. The expectation of a user is that HTML output will look close to or identical to the odt file. It does not, and it is wordy. The “not look like” makes the end result marginally useful.

I don’t know the implementation language used, and I am unknowledgeable about libreoffice, but if I can help in any way just ask.

I just uploaded two file, example.odt and example.html.doc, which should be, example.html. Notice that when you Save As and html file in libreoffice, the space between Re: and Subject: and and the following character is ignored., except for inclusion of a <tab> character. In example.html I have corrected this by putting a <span></span> into the html document. My point is that that is something that the html converter can do.

Much the same issues were presented in HTML Code Generator works poorly. The generator ignores the spacing in the .odt document in the sense that the meaning of a ‘space’ or is not in the HTML. This leads to an HTML representation different from the.odt document. The only answer that i received was that that is not a bug. I tried to say that what is needed is an HTML page which conforms as closely as possible to the look-and-feel of the original document with no success. What I ended up doing is to hand modify the output by putting in things like <span></span>, and by never using the HTML convertor again. Instead, I do it all by hand. As to the question below about end-around coding and an HTML editor. Same statement (by me), same response by the developers. Good luck.