Problem with showing/converting PT chars in LibreOffice

Hi,

We have a problem with using soffice and also showing the initial docx file.

$cmd =’“C:\Program Files\LibreOffice\program\soffice.com” “-env:UserInstallation=file:///’ . str_replace(”\", “/”, $tempLibreOfficeProfile) . ‘" --headless --convert-to pdf --outdir "’ . str_replace("/", “\”, $FILESERVER) . “temp” . ‘" "’ . str_replace("/", “\”, $TmpFile) . ‘"’;

So any file I open with LibreOffice under my Windows machine, I get emoji instead of PT chars, the same goes for converting files and I didn’t find a solution.

Can you please help fix the encoding in LibreOffice or change it ?

I have uploaded a small screenshot of what LibreOffice Writer shows on screen

Screenshot 2022-01-18 at 13.58.48

What is “PT chars”? Googling doesn’t give some reasonable answer (is that about Portuguese?). Attaching a sample source file would likely help much more than a screenshot.

Here is the document:

I am reffering the portugal characters which are wrongly shown in the document below under LibreOffice and also wronly converted when using soffice.com executable. The is a sample document as the final document has customer data.

_tmp.docx (25.9 KB)

I opened the source document in LO 7.2.5.2, and it showed exactly the same problem. So, it is about your file.

pt_sample

Further, here is what MS Word 2016 shows for that file:

It is obvious that the file itself is corrupt. And indeed, the word/document.xml contains non-UTF-8-encoded octets:

(note that the XML declaration of that file claims it’s UTF-8 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>).

The document’s docProps/app.xml claims that the generating application was MS Word, but I actually doubt that…

Your original document has been damaged some time during its existence. Taking the example of “CONSTRUÃ>‡Ã)ƒO” and using Alt+X to access the encoding, I see that Ã>‡ is exactly 0xC3 0x87 which should display as Ç U+00C7 and Ã)ƒ is 0xC3 0x83 which should give à U+00C3

It is likely that at some stage, the UTF-8 encoding has been mistaken for some ISO-8859-x and badly transcoded. Does the document display correctly in Word?

I suppose you meant that Alt+X gave you U+00c3U+0087, which is not the same as 0xC3 0x87, the latter looking like some byte values, while the former being Unicode codepoints. The internal representation of the text (in the XML files) is four byte long: C3 83 C2 87, i.e. a UTF-8 sequence of ‘LATIN CAPITAL LETTER A WITH TILDE’ and ‘END OF SELECTED AREA’. I was wrong thinking that the XML is not correctly encoded; yet, the second character looks wrong. Possibly it was some wrong Win12xx->UTF-8 conversion at some stage (using Win12xx byte values as Unicode codepoints?)…

Yes, I over-simplified assuming OP is not a computer techie. The byte value has been turned into a Unicode char which is in turn encoded in UTF-8. In short, a real mess.