Import html preserving style (or style name)

Like explained in Preserving classes as styles when opening HTML documents.

I need to preserve style name when I import an html file for example:

<p><span class="direct">bla bla bla</span></p>

I really need to obtain a odt document with character style named “direct”, but I don’t care about the actual style effect.

There is a way to do that?

I don’t think this makes sense. ODF and HTML don’t address the same domains. IMHO, HTML has a much wider scope because it is not centered exclusively on text rendering.


Also, class="…" doesn’t designate a “style”. It can be used for many other purposes. Think of all the various frameworks available today. They heavily use <div class=some_id> to designate groups of locations. This is useful for positioning blocks around the screen.


The CSS structure is not the same as ODF. in HTML, you can write <span class="style1 style2> to request application of two CSS decorations (after resolving possible conflicts through CSS selectors), whereas in ODF you can have a single character style only.


I understand your purpose of preserving the semantic markup present in the HTML file. Perhaps the solution would be a transformation of your HTML into ODF XML. This may be feasible with rewriting tools such as XSLT. Be prepared to add the necessary ODF preamble to obtain a usable ODF file. You can afterwards create the style definitions manually in Writer.

2 Likes

Hello Ajlittoz,
well, I understood that styles were libreoffice’s way of sematically defining the text, but ok maybe my understanding is wrong.

Anyway, your suggestion seems to be a good idea, but I think to need some help to do that.
I easly create and odt+xml with pandoc, but when I open the document I can’t find any style.

I can try with EregReplace tool (the text is quite simple), but I’ve to convert in what? I mean, what tag should I use to be interpreted as a style?

P.S.
could be a great feautre if we can set the style to use/preserve during the import of an html or xml

Styles are indeed, when “correctly” used, a semantic markup of your document.


I have not yet attempted to create an ODF Writer document directly. From what I see when I save .fodt, the XML contains references to styles. But this is not as simple. All styles definitions are embedded inside the XML in a style dictionary, even built-in styles. This makes the file self-supporting and immune to changes between releases (i.e. developers may change style definitions, this does not impact the existing documents). But there are subtleties. You can rename styles. To avoid parsing the XML to change style names, all styles receive an internal alias name and this alias name is used throughout the XML. Thus, you can change the style name in the dictionary, the alias doesn’t change and the document contents need not be rewritten. Cute!


Adjusting HTML to make it look like an ODF file contents is the “easy” part: ODF is a public standard and its OASIS specification is available on the internet. The difficult part is to generate the ODF/OASIS skeleton (preamble + style dictionary) to put the flesh (document contents) on. To succeed here necessitates to study very carefully and thoroughly the ODF standard. And its likely you won’t succeed on first attempt.

Styles are LibreOffice’s way of semantically defining text, but the styles CSS/HTML use and the styles LibreOffice use are different things entirely. Writer the Word Processor part of LibreOffice uses XML Type Paragraph, Page, Frame and Character Styles.

well I’m not sure about convertion, I attach a simple example.
Only Main title is recognized, h1 is understood in navigation panel but it isn’t linked to standard heading style
tryConvertionHtml2ODT.zip.odt (2.2 KB)
The file is a zip (I had to add .odt extension for upload)

I tried renaming to tryConvertionHtml2ODT.odt and got a message saying it was a corrupt file.

You should remove the .odt extension as OP told it is a .zip file. Then you have access to 2 files.

Please describe what should be checked. I understand that tryconvertIntoODF.xhtml is the original file to convert. But I can’t succeed in loading output.xml into Writer and have it recognised as a text document. All I get is a display of the XML.

What I see is all <span> elements have disappeared.

Also I don’t understand styling with First paragraph as I don’t see any reference to it in the original file. Have you forced it manually? Have you configured the conversion process to insert <head><title> as the very first paragraph styled Title? No clue nor help can be given if we have no idea on what you’ve already done and how.

One CSS feature which can’t be transposed in ODF is ::before and ::after. Most of CSS “styles” can probably be translated into ODF styles but this requires deep analysis and careful programming.

Ehm… The file is a zip, you should remove .odt not .zip

Well I open the output XML setting “non compressed XML document”.

I know before and after couldn’t be rendered in LibreOffice (should be a nice improvement), but I obtained the output.xml using pandoc.

What I’m trying to understand is if the styles defined in output.xml are well-formed or not, and if not what’s missing

The main problem with styles is they are used to tag the paragraphs but they are not described in the style dictionary. Therefore Writer reverts to Default Paragraph Style when it load the document.

I don’t understand then why Title is “correctly” (as far as I can tell) interpreted, without added direct formatting.

“Try convertion …” is attached to outline level 1 through direct formatting. “Title 1” paragraph has also received an outline level 1 direct formatting which explains why these paragraphs are listed in the navigator.

I think the problem lies in pandoc. From Wikipedia:

Maintaining the look and feel of the document is not a priority.

Pandoc apparently converts only the (semantically significant?) markup, leaving aside all the backstore information needed to make a true ODF document. I am surprised that, with a claim to keep markup, all <span> elements, which are an important part of semantic markup leading to character styles, are simply ignored.