HTML Code Generator works poorly

ostbits · March 18, 2020, 7:05pm

LIbre Office Version: 6.3.5.2 (x64)
Win 7-64

The existing converter is very bad. It looks like a macro generator program which produces pretty awful HTML code. So there are two questions. The existing code output is compliant with HTML 4.0, will this be changed to HTML 5? And, has the code generator been modified to produce better HTML?

When I open an HTML file in a browser it looks ok. So I’m not arguing that the end result of a bad code generator leads to a poor viewing as an HTML document. Instead, the code quality seems poor. Let me show you some examples:

<p class="western"><br/>

</p>

instead of <p></p> or <p><br/>

or

<a name="__RefHeading___Toc5237_2473493380"></a>Oct. 29, 2019 <font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>D</b>    </font></font><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>enial
 of Architecture Committee Documents</b></font></font><br/>

instead of 
<a name="__RefHeading___Toc5237_2473493380"></a>Oct. 29, 2019 <font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>D</b><b>enial of Architecture Committee Documents</b></font><br/>

or 
<p class="western">I hope you have a splendid day.</p>
<p class="western">All the best,</p>
<p class="western"><br/>

</p>

instead of
<p class="western">I hope you have a splendid day.</p>
<p>All the best,</p>
<p ><br/></p>

There are more examples, but these illustrate the idea. The output is much too wordy, and very difficult to deal with.

So, is any work being done on an HTML code generator replacement?
art

mikekaganski · March 18, 2020, 8:00pm

What you show does not demonstrate any poor HTML (compared to what you proposed; especially in first example of last block, where you suggest an invalid markup with non-matching closing  tags). E.g., you suggested to drop some attributes like “class” that might be not important to your use case, but is important for roundtrip of text document information. LibreOffice is not an HTML editor; Writer is able to export and import to/from HTML. And it tries to keep as much information in that foreign format as possible for user to continue working with the text document after reopening.

However, patches are welcome - if you have some ideas, file bug reports and come with fixes.

ostbits · March 18, 2020, 9:54pm

Hi Mike. thanks for the comments but I think that you misunderstood the examples (please correct me if I’m wrong). The first example was the output of a vacuous  . Am I mistaken in assuming that there is no information to be gained by including the class attribute. I assumed that a blank line was a blank line. Thank you for correcting me and for saying that the definition of a blank line depends on the language. I did not know this. I did not say to abandon the ‘class’ attribute. But I wonder if it might be possible to use it less often without losing any of its useful properties. And I clearly think that where it is possible to surround a group of statements with an attribute iis better than not surrounding a group of statements. My second and third examples show that. As for end-around programming, the suggestions add complexity to generation and abstraction, but do not inhibit it. You would have to recognize a ‘group’ where now you don’t. Sorry about .

ostbits · March 18, 2020, 9:55pm

example should be:  

mikekaganski · March 19, 2020, 5:59am

@ostbits:

Case   → 

The  tag represents paragraph concept in Writer; the   is for hard line break (the thing you typically insert using Ctrl+Enter) which does not start a new paragraph. Only if you had no line break character in your “empty paragraph”, it should be treated a bug. But you didn’t give a sample ODT with source text that generates the output you consider bad, so I suspect it’s misconception on you part.

The “western” class for empty paragraph is perfectly normal and useful thing. It is any paragraph’s property (including empty); and in case of empty one, it e.g. defines that when user later starts typing in that paragraph, the text would be LTR, not the other way round. Omitting it would lose this information from round-trip.

mikekaganski · March 19, 2020, 6:08am

Case ......

In this case, Writer follows its internal structure for direct formatting. You have several runs of text in your document, having different set of direct formatting. Writer handles that by defining property set for each such run. For the markup in question, there’s a run with a set font and bold, and another run with a set font (that happens to coincide) and not bold. Writer outputs it that way, and it doesn’t do any complex processing to detect overlap of some attributes. It doesn’t guarantee from anything, e.g. in case when you have partially overlapping attributes, like here:

Text without effects bold text bold italic text italic text

Should this be split by overlapping bold or by overlapping italic?

mikekaganski · March 19, 2020, 6:12am

Writer uses a simple and scalable approach of handling each such run separately (“go to next run with different format; output all formatting of this run; output the run’s contents; proceed to next run; rinse and repeat”). This keeps all informatting necessary for round-trip - and again: LO is not an HTML editor to pursue some “minimal” or “cleanest” HTML, only correct. There’s a concept of character styles, which in theory allows nesting. That would ideally produce something like what you need. Unfortunately, the UI for generation of nested character style runs is missing (tdf#115311). But that’s orthogonal to the HTML export.

ostbits · March 19, 2020, 6:05pm

Hi MIke; I don’t want a war. So, I think you are the developer, and should be rightly proud of your work, and I am the consumer, with a different perspective. The generated HTML is too verbose. I think that we should chat in private, and not be limited to 1000 characters. In order to do an end-around you need a 1-1 and onto (bijection) between the output code and the input import. But, this can be done in a number of ways. It looks like you did a single scan of the input (a left-to-right scan) and generated an output stream based on what was just ‘read’. If you investigate the generated output (not deal with libre UI issues) you can optimize this output. Several ‘easy’ optimizations are possible. Determine what constructs are most used and either annotate or put in HTML code which can be used. And optimize the output, such as the example. And, you can now say to me that that is “awful”. It’s a win-win.
art

ostbits · March 19, 2020, 7:23pm

Here is one example of the types of local optimizations which are possible:

before

<h1 class="western"><a name="__RefHeading___Toc5247_2473493380"></a>Nov.
5, 2019 <font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><i><b>C</b></i></font></font><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><i><b>ivil
Code</b></i></font></font><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>
&sect;5200 is </b></font></font><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>not
</b></font></font><font face="Liberation Serif, serif"><font size="3" style="font-size: 12pt"><b>Prohibitory</b></font></font><br/>
<br/>
<br/>

</h1>

after

<h1 class="western"><a name="__RefHeading___Toc5247_2473493380"></a>Nov. 5, 2019 <font face="Liberation Serif, serif" size="3" style="font-size: 12pt"><i><b>Civil Code &sect;5200</i>is not Prohibitory</b></font><br/></h1>