Chinese text file doesn't display correctly on opening in Writer

When opening a Chinese Traditional text file with Writer on LO 4.0.2.2 Linux, the characters don’t display correctly, even if I ‘select all’ and change the font to a Chinese font such as SimSun or SimHei.

The file opens correctly in GEdit and Chinese text displays, using the system default font ‘Monospace’, which is DejaVu Sans Mono.

Initially I thought that this was because Chinese fonts were not correctly installed. But the following happens too:

  • If I copy and paste all the text from Gedit into a Writer document, Chinese characters do display correctly, apparently in ‘Times New Roman’.

  • If the text file is imported into a Calc spreadsheet, the ‘encoding’ is recognised as ‘Unicode’, and the characters do display correctly.

Asian language options are selected in Tools, Options, Language Settings. However, this doesn’t seem to matter- on LO 4.0.2 with Windows, the file can be opened and displayed with Writer even though these are not selected! So this seems to be related to LO configuration on Linux.

A sample file is attached. It is a .zip file (to preserve the original file), but I had to give it a fake .jpg extension name, as .zip is not an allowed extension type!

New sample (original unedited file as received is attached
C:\fakepath\fullsamplefile.zip.jpg

$ file Full\ sample\ Chinese.txt
Full sample Chinese.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminator

@emulti – Upvote = 10 “karma points” to upload a sample file …

@emulti this is possibly an encoding issue similar to that which we are attempting to sort out in this thread. As @manj_k indicates, a sample file would be a great help.

See also → Bug 63673 - FILEOPEN: Unicode text encodings not auto-recognised in LO4.0.2 Linux (OK in Windows version?)

Thanks for the example file. Your problem is one of character encoding. Here is a quick test:

$ file -bi Chinese\ sample.txt 
text/plain; charset=iso-8859-1

ISO-8859-1 is an 8-bit encoding that offers support for a selection of European languages. I recommend you use UTF-8 in your files to obtain Chinese support. Here is an example file (ODT) taken from the Chinese version of the Wikipedia page linked to that is encoded using UTF-8.

To be clear, the text does not display Chinese characters here under gedit.

EDIT: That second example is now showing up as UTF-8 and displaying the Chinese characters as expected. I think this confirms it was a character encoding issue.

2nd EDIT: Third example (as indicated in comments below) is now UTF-16LE encoded. Use of “Select which types of files are shown” pull-down list OR File Type selection when opening the text file will allow correct display within Writer (after choosing appropriate filter options).

I think that maybe Gedit (which I used to extract the sample) is saving with that encoding. I have added the original file as received into the zip file (named as .jpg) instead. $ file Full\ sample\ Chinese.txt
Full sample Chinese.txt: Little-endian UTF-16 Unicode text, with CRLF, CR line terminator

Possibly the issue is one of character encoding, but in that case, why does the file open correctly on LO Windows, but not in LO Linux? According to the ‘file’ command, it is “Little-endian UTF-16 Unicode text, with CRLF, CR line terminators”

Some progress: double-click on the file opens it in Writer as ‘text’, and the Chinese characters are not shown correctly. However, if File…open is used, and the ‘file type’ filter is set manually to ‘Text Encoded’, then it’s possible to select ‘Unicode’ as the Character Encoding (and also whether the Line Separator includes Line Feed as Windows text files do). In this case, the file is displayed correctly. The ‘Byte Order Mark’ first 2 file bytes is not automatically recognised on LO Linux.

Confirmed. Sorry, my initial reply / test was erroneously done using an old version of LO. GEdit opens the file fine via double-click however LO v4.0.2.2 relies on the filter facility to determine the encoding. Default behaviour is to (apparently now) ignore the encoding information, or at least relying on it being manual set via File Display/Type pull-down list. I’ve updated the answer above to reflect these findings. I used File Display rather than File Type but result is the same.