Pyuno convert docx to pdf

Problem: Unexpected character in pdf export from docx

pyuno specific ?

This is the guide - How to use the Ask site? - #3 by Hrbrgr

of course effect generated by pyuno, but some strange characters (three waves) appear. how toavoid?

I tried opening in Word the sample DOCX file you provided, to see if the strange characters are there, but Word failed to open it, because it couldn’t find the sample file.

1 Like

This is my file, it’s ok in docx, but strange in pdf
12.docx (11.7 KB)

still …

The file has a PUA character U+E008 in the end of the “bacon” line. This must mean, that it depends on some specific font on the creator’s system, which is needed on others’ systems to be able to display it as intended. Neither your system, not mine, have that font.

On my system, this is how MS Word shows it:

Note the whitespace before the pilcrow in the discussed line.

Now this is what I see in LibreOffice 24.8.2:

A black circle is shown there. It likely happens to be present in some installed font at that position, and is used as the substitute. On your system, obviously, the substitute character happens to be that strange wave-like symbol.

This is the result of the best-effort attempt that LibreOffice does, to show you the character which is used in the document, but is absent in the font applied to the text.

Ask the author of the document to mark the special characters they insert with the proper font, and provide that font along with the document. Or better don’t use the PUA characters at all.

1 Like

I see. Now the solution is to replace the PUA(private User Area) character by null string.

import re
wrong_text = "bacon/ˈbeɪkən/n.咸猪肉;熏猪肉"
expect_text = re.sub('[^\u4e00-\u9fff]$','', wrong_text)
# 'bacon/ˈbeɪkən/n.咸猪肉;熏猪肉'

Thanks for your detailed answer. I’m not sure what the unexpected character is, how does the system render and represent the character. Your reply make me know clearly that I should delete any character that isn’t ended with Chinese character.