Ask Your Question
0

Mulitple bugs in Writer to HTML conversion

asked 2020-03-23 20:53:07 +0200

ostbits gravatar image

updated 2020-03-25 16:33:18 +0200

ajlittoz gravatar image

Multiple bugs were detected in an .odt to HTML conversion. The odt file is attached. I can not upload the x.html file.

  1. Roman Numeral Chapter number is unrecognized.

    nstead of "<h1>II thing"</h2> the output is <h1>IVThing</h1>. Note that the chapter number "II" and the title are not separated by a space.

  2. Alphabetic header numbering is not recognized.

    Instead of <h2> b thing </h2> I get <h2>bthing</h2>.

  3. Incorrect indentation of text following a header.

    Instead of <p>&nbsp&nbsp&nbsp&nbsp;see attachment</p> or <p style = "margin_left: xxpx>see attachment</p> I get <p style="margin-top: 0.05in"> See Attachment</p> and the spaces are ignored by browsers

This is not an error but why is the DOCTYPE HMTL 4.0 transitional instead of HTML 5.0?

Edited by ajlittoz for better formatting

C:\fakepath\thing.odt

edit retag flag offensive close merge delete

2 Answers

Sort by » oldest newest most voted
0

answered 2020-03-24 12:20:10 +0200

ajlittoz gravatar image

updated 2020-03-25 17:04:33 +0200

  1. No space after chapter numbering

    You must configure the properties of chapter numbering with Tools>Chapter Numbering. The settings you may have customised in Writer are not forwarded to the HTML editor. Consider they are separate applications settings-wise.

  2. Alphabetic numbering

    Same as 1. above. Configure all your levels, notably the separator before and after parameters.

  3. Indentation

    If indent is made with NBSP, those NBSP are present in the resultant HTML and rendered as expected.

    However, if you use ordinary spaces for your indent, those spaces are also present in the HTML (Writer takes no initiative about that and you can't blame it), but W3C says that any sequence of ordinary spaces is considered as a single space and browsers render it accordingly (eventually stretching ALL spaces in a line to justify adge to edge). This is intended behaviour in browsers.

    Anyway, in Writer or browsers, it is always a bad idea to indent paragraph with spaces. It is much better, safer and reliable to use styles (or as a fallback, cursors in the horizontal ruler). This will translate to CSS and give highly predictable results.

Note that there are far less built-in styles in Writer/HTML component than in Writer. But you can always define custom ones for your needs. They will translate to CSS to be rendered in a portable way by browsers.

To summarise: no bug in Writer/HTML but a misconception of yours about differences between HTML edition and document edition (and also a different base configuration in those components).

EDIT 1

CAUTION! Save as .odt or save as .html from an .odt is not the same as File>New>HTML Document then save!

In the first case, you edit a sophisticated formatted document which is converted into an HTML one with embedded CSS directives trying to mimic Writer styles. In the second case, you start directly into HTML formatting with simplified styles corresponding to W3C standards. This could explain the differences. I conducted my test in the second context.

However, in both cases, you don't edit directly HTML markup. You still control Writer through its styles (full or reduced capabilities) and these styles are translated, perhaps not as you would like, by an export filter (may be approximate; remember: Traduttore, traditore).

EDIT 2020-03-25

I had a look at the attached sample file.

  • It is an .odt file

    This may seem obvious and queer from me. However, as I explained, File>NewText DocumentandFile>New>HTML Documentare different and launch different LO components. By creating an *.odt* document, you format a common paginated to-be-printed document. WithFile>Save As` in .html format, the document will be approximately translated from ODF concepts to HTML concepts.

    The first difference: from pages to continuous stream.

    A second difference, of utmost importance in your case: HTML has no notion of tab stops; consequently tabs are simply ignored (they could have been translated into a space but the output converter ...

(more)
edit flag offensive delete link more

Comments

Thanks for commenting. Let me try to address the issues as you have presented them.

  1. I have taken your advice and configured Chapter Numbering in Tools->Chapter Numbering so that there is a space between the 'number' and the title. This has had no effect on the output. The output shows "IIthing" instead of "II thing".
  2. The issue is moot. The heading2 entry has been replaced with a list.
  3.   was not ever used. The text shows that this was an example output of the HTML generator that would accomplish the required spacing. My original test .odt file, which I belief I uploaded, shows that I did not insert any spaces. LibreOffice positioned the "See Attachment" line directly under the title in the 'Heading 2" item. This placement was not captured by the HTML generator which instead, output a series of spaces preceding the text. As you have rightfully pointed out, consecutive ...
(more)
ostbits gravatar imageostbits ( 2020-03-24 19:38:46 +0200 )edit

This issue is that there are bugs in the HTML generator. The bugs include a failure to recognize odt spacing in heading titles and the failure to recognize automatic paragraph indentation done in .odt files after headings in the following paragraph, and the failure to substitute &nbsp`; for each space of multiple spaces. Not shown in this example is the failure to properly indent lists within chapters.

Sorry about misspellings in the preceding comment (and other errata).

ostbits gravatar imageostbits ( 2020-03-24 19:44:49 +0200 )edit

please attach your document to your question for further analysis. For that, use the edit link under your question, then the paper clip tool (position the cursor to some convenient location, e.g. a paragraph by itself separated from the rest by at least a blank line).

ajlittoz gravatar imageajlittoz ( 2020-03-24 20:00:04 +0200 )edit

Sigh. I hate being wrong. The included file is an example of just about everything. I was wrong in stating the header 2 "See Attachment" line was placed w/o spaces. It was placed with tabs. I don't know what issues the API has but I do think it is possible to do better. The issues faced are close to those faced by compiler writers, and the solutions are very similar. What seems missing is (1) local optimizations, (2) global grouping, (3) use of odt parameters for things, such as paragraphs, headers, lists etc. The expectation of a user is that HTML output will look close to or identical to the odt file. It does not, and it is wordy. The "not look like" makes the end result marginally useful.

I don't know the implementation language used, and I am unknowledgeable about libreoffice, but if I can help ...(more)

ostbits gravatar imageostbits ( 2020-03-25 16:14:02 +0200 )edit

I just uploaded two file, example.odt and example.html.doc, which should be, example.html. Notice that when you Save As and html file in libreoffice, the space between Re: and Subject: and and the following character is ignored., except for inclusion of a <tab> character. In example.html I have corrected this by putting a <span></span> into the html document. My point is that that is something that the html converter can do.

ostbits gravatar imageostbits ( 2020-03-26 23:52:03 +0200 )edit
0

answered 2020-03-24 11:52:46 +0200

JEDIMASTER gravatar image

updated 2020-03-24 20:17:49 +0200

Writer is a Word Processor not an HTML Editor. to use html5 you have to download the extension here: https://extensions.libreoffice.org/@@... or use a real HTML Editor like the one at http://www.bluegriffon.org Writer can not by design create HTML 5 Documents until you can get the Developers to update that option you will either need the extension I link to or a real HTML Editor not something that just exports to HTML. If this helps then please tick the answer (✔) ...and/or show you like it with an uptick (∧)

edit flag offensive delete link more

Comments

Writer is a Word Processor with an option to convert and output an HTML file. Take a look at the File->Save As options. Since HTML files can be opened, the Word Processor can be used as an HTML editor. I may be wrong but it seems to me that if you and (1) output an HTML file,and (2) input an HTML file, and (3) modify an HTML file, then "writer" is an HTML editor.

ostbits gravatar imageostbits ( 2020-03-24 17:51:08 +0200 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2020-03-23 20:53:07 +0200

Seen: 96 times

Last updated: Mar 25