Lost PUA chars with xlsx/ods formats

Calc opening/saving as xlsx/ods format will lose PUA characters.

Unicode PUA zone has 3 area:

  • U+00E000-U+00F8FF Private Use Area
  • U+0F0000-U+0FFFFF Supplementary Private Use Area-A
  • U+100000-U+10FFFF Supplementary Private Use Area-B

but from original file saved as ODS, the PUA chars will be removed. (after opening file)

I lost a lot of data because of this issue

The same problem occurs when accessing XLSX files

can anyone tell me the solution?

evo: Windows10/LibreOffice 7.6.2.1 x64

test file: PUAtest.xls (137 KB)

File a bug report.

thaks. the issue are created.
https://bugs.documentfoundation.org/show_bug.cgi?id=157929

I hope have easy way to solve the issue.

Your “weird” characters represented as empty rectangles are not in the PUA. Opening your document in 7.5.7.1 under Fedora 38 (Linux) shows they are in fact surrogate pairs.

  • in A9 U+DDBF U+D858
  • in A10 U+ddf0 U+D868

The codepoint in A9 is U+261BF, in A10 U+2A1F0, both of which are not in any PUAs. You should check your encoding. Prefer an explicit direct UTF-8 instead of an indirect UTF-16 surrogate pair. UTF-16 has endianness issues when document is exported. Anyway, the surrogate pair is not recognised as such. If I switch the members, I get a better result consisting of a single X-crossed rectangle (meaning the pair has been correctly interpreted but glyph was missing).

4 Likes

I believe that checking encoding, and choosing the encoding in file formats like XLS(X) is not something a user can do. These formats have own ways of storing Unicode. E.g., OOXML uses ST_Xstring (escaped scting) of the form _xDDBF__xD858_ (for A9), representing each UTF-16 code unit as a _xHHHH_.

Curiously, Excel itself shows two squares for these cells, showing that it doesn’t consider this sequence as a valid surrogate pair. And you are completely correct, that this sequences are inverted - the high surrogate (going first) must be in range D800-DBFF, and the low surrogate is DC00 to DFFF. Endianness has no effect here, because UTF-16 code units order is fixed, and only bits inside the units themselves are affected by endianness.

1 Like

so… is not standard PUA chars.

I guess the original file was corrupted by a problematic exporter. (by source website)

Clac will automatically delete these incorrectly encoded characters when accessing Ods/Xlsx format.

but I want to convert to ODS format and keep all miss encoded PUA characters instead of removing them.

Is there any way to accomplish this in CALC ?

You filed a bug. It will be fixed, when (if) it will be fixed.

1 Like

I opened your file with Calc. Data is not deleted. Your cells show “blank” perhaps because the font used for display has no “missing char” glyph or this glyph is mapped to blank or void.

I saved the .xlsx file as .odt and your faulty characters are still there.

If you are certain that these faulty characters are supposed to be ones from PUAs, then something went wrong while creating the original file.

  • the decoded codepoint is not a PUA member, so find out which one should be used (give it as an hexadecimal codepoint)
  • the surrogate pair is given in the wrong order

Solve first these two issues. Until your file is not correct, nothing can be told about Calc behaviour.

Hmm. I can’t repro, both 7.0 and 7.6 drop them on save-ODF-and-reload.

Apologies.
Re-did the test. I probably made a confusion in the numerous open windows at this time. The surrogate pairs are effectively eliminated during save. If I switch the pair members, reload is OK.

Apparently, the culprit is the wrong order in the pairs. Since this is a faulty Unicode sequence, it is not kept while saving.

IMHO, the first step towards solution is to fix the wrong surrogate order in the original file or whatever process created it.

1 Like

I try “save as ods” step over 20 times, I’m pretty sure those incorrect characters will be removed by Calc. (at my PC)

after saving Ods format, these chars still there because the data still in memory.

when Calc be closed and reopen the Ods file, the data disappears.

the same goes for saving files in Xlsx/Csv format.

source website repair exporter is the finally solution.

I do need to keep these incorrect characters now, I just want to demo that even the data included errors.

This is not a solution to the initial question. Delete your non-answer.

In addition there is no description of your problem in your post. Either it is related to the present one and you need more detailed information, then comment the appropriate post; or it is a new question, then ask your own, eventually linking to this one.

In any case, mention your OS name, LO version and save format.