Problem: Unexpected character in pdf export from docx
of course effect generated by pyuno, but some strange characters (three waves) appear. how toavoid?
I tried opening in Word the sample DOCX file you provided, to see if the strange characters are there, but Word failed to open it, because it couldnāt find the sample file.
still ā¦
The file has a PUA character U+E008
in the end of the ābaconā line. This must mean, that it depends on some specific font on the creatorās system, which is needed on othersā systems to be able to display it as intended. Neither your system, not mine, have that font.
On my system, this is how MS Word shows it:
Note the whitespace before the pilcrow in the discussed line.
Now this is what I see in LibreOffice 24.8.2:
A black circle is shown there. It likely happens to be present in some installed font at that position, and is used as the substitute. On your system, obviously, the substitute character happens to be that strange wave-like symbol.
This is the result of the best-effort attempt that LibreOffice does, to show you the character which is used in the document, but is absent in the font applied to the text.
Ask the author of the document to mark the special characters they insert with the proper font, and provide that font along with the document. Or better donāt use the PUA characters at all.
I see. Now the solution is to replace the PUA(private User Area) character by null string.
import re
wrong_text = "bacon/ĖbeÉŖkÉn/n.åøēŖčļ¼ēēŖčī"
expect_text = re.sub('[^\u4e00-\u9fff]$','', wrong_text)
# 'bacon/ĖbeÉŖkÉn/n.åøēŖčļ¼ēēŖč'
Thanks for your detailed answer. Iām not sure what the unexpected character is, how does the system render and represent the character. Your reply make me know clearly that I should delete any character that isnāt ended with Chinese character.