Converting to epub text with greek letters / symbols

cmp · November 16, 2023, 11:18am

Hi all,
I am trying to convert to epub an Italian doc (but it could be English as well) that contains greek letters (math symbols) in the text, as well as formulas created by LO Math features embedded in Writer). When I use the built-it “export to epub…” feature the resulting text shows the greek letters as small empty squares. I can get around this problem for formulas by pasting an image of the formula in the text, but this would be time consuming for all the greek letters in the text. I have searched around with no success. Looking for suggestions…
Using LO 7.5.8.2 on Win 11
Thanks!
Cesare.

ajlittoz · November 16, 2023, 11:39am

Please edit your question to mention OS name, LO version and save format. Don’t use a comment, modify your question this is more contributor-friendly.

Usually the presence of empty rectangles signals an issue with the font, i.e. missing glyphs in the selected font. Does conversion to epub implies a single font face must be used throughout the document?

Can you attach a single-page sample containing your Greek letters with other text for further analysis?

cmp · November 16, 2023, 12:55pm

I can only attach one file per post. Attaching a single page from ODT; I can also share a screen capture of the same file open in Calibre Viewer where the problem shows up.
Thanks.

EX2-ODT.odt (52.9 KB)

cmp · November 16, 2023, 4:18pm

Made some progress thanks to the hint about fonts. The characters that are not correctly displayed are in font “Segoe UI Symbol”. If I re-type them with “Times New Roman” using the Windows Greek keyboard they show up correctly.
I will do some more experiment, even though I think this is the solution, and then I will look at formula. As mentioned the formulas created using the built-in formula feature don’t show up in the epub format.
Thanks!

ajlittoz · November 16, 2023, 4:46pm

Formulas are kind of images. I am not familiar with ePub format. So check if images can be exported to ePub. I know some formats only accept text.

EarnestAl · November 16, 2023, 6:55pm

This family of fonts has a restrictive licence, it might affect embedding

cmp · November 16, 2023, 9:27pm

After further investigation, it seems it was more complex that just font, and it was peculiar to my case.

The doc was imported from a PDF into docx with Word, and then opened in Writer and saved as ODT. The characters that could not be handled in epub probably were kind of special, from some strange codepage. For instance, with reference to a “lambda” that appeared as empty box in epub, once I deleted it and re-typed using the Windows Greek keyboard it looked a bit different graphically (so it must have been a different character) but then it showed up correctly in epub. The strange thing is that the font was the same (Times New Roman); I just deleted the “lambda” inherited from the PDF and type a “fresh one” from the keyboard within the same font.
So, on this point I believe now I know exactly the conditions for this to happen and how to fix it, even though I don’t have the knowledge to understand the root cause.

Regarding formulas, I am using a bypass I already used in the past: make an image of the formula and paste it alongside the regular formula, so I can still change it if needed. only the image will show up in the epub.

I am happy with the solution found. I just hope it can be useful to others, as it was a pretty special case.

Thanks for your help.
Cesare.

ajlittoz · November 17, 2023, 8:06am

If you suspect a “bad” character encoding (many third-party legacy fonts when “converted” to Unicode store their unique parches in the Private Use Area which, as its name suggests, is not standardised; it kind of works as long as you use the original font and breaks with now-standard fonts), you can check as follows:

put the cursor immediately at right of offending glyph (empty rectangle)
press Alt+X
the glyph is converted to “U+abcd” where “abcd” is the hexadecimal encoding

If you read something between “U+E000” and “U+F8FF”, text was originally typeset with one of these converted legacy Windows fonts.

cmp · November 17, 2023, 8:33am

You are right. In the case of lambda, the encoding was U+f04c. LO can handle them, but the epub format doesn’t support them.

The doc was written 10 or 15 years ago (probably in Word on a Win PC) and made into PDF; now I am creating an epub from the PDF. I did not want to convert straight from PDF using Calibre as the resulting file would be hard to read on the small screens of e-readers. So I went through the PDF->DOCX->ODT->EPUB route using LO.

Now it’s all clear, and the solution is just to re-type those characters.
Thanks!

ajlittoz · November 17, 2023, 9:23am

The full explanation is then as follows:

you created the document under an old Windows version in a time were character sets where limited to 256 characters, the higher 128 (more exactly 96) of them being vendor-defined
the font was converted as I described when Windows went to Unicode with vendor-glyphs being sent to the PUA
some time later, the font was replaced with a more modern format (e.g. going from bit-map instances to single scalable vectorial shapes) containing only the “standardised” characters, i.e. discarding the PUA which by definition is not part of any standard

When you reopen you document nowadays, the PUA glyphs are missing and are replaced with a special glyph to warn you against the problem.

I don’t think “ePub doesn’t support them”. I rather think that the export filter seeing that the glyph is missing just ignores the character. A developer could tell for sure.

cmp · November 17, 2023, 10:05am

Perfect.
Just one final detail to clarify: these “offending characters” do show up correctly in LibreOffice (lambda looks like lambda). Apparently they are not handled (i.e. discarded and replaced by the empty square) by the epub converter. That’s why I only found out when I created the epub.

ajlittoz · November 17, 2023, 10:23am

Handling in Writer is quite different from what the export filter does. In case a glyph is missing from a font, the font renderer will try to substitute another one. It looks for some font in the same “family” (don’t ask me how membership in a family is implemented, I don’t know it beyond there are tables in the font file for that) offering the glyph before reverting to more general fonts, ending with a default system font.

This procedure is not followed by the export filter because it does not go through the font renderer.