How can I search and replace Chinese characters in a multi lingual document?

russelld · December 31, 2019, 12:59am

Hi Libreofficers!

I have a spreadsheet with cells that have both English alphabet and Chinese characters in the individual cell contents.

How can I remove the Chinese characters without removing the English alpha characters?
Also how the same be done to remove the English alpha characters and leaving the Chinese characters?

image of spreadsheet with highlighted cells that need this to occur:

this is with LibreOffice Version: 6.2.8.

thanks in advance for your help!

gabix · December 31, 2019, 6:20am

You have to learn about regular expressions (and this how-to may be useful, too).

In particular, in order to remove any Chinese (Japanese) hanzi (kanji) character enable regular expressions and use the following find string:

[一-龥]

and leave the replace field empty.

russelld · December 31, 2019, 10:00am

Dear gabix,
Thank you for the elegant solution to the removal of Chinese characters!
This worked perfectly on the sample cells and the links to regular expressions are very helpful.
thanks again

r

mikekaganski · December 31, 2019, 7:04am

Using regular expression syntax used in LibreOffice, you will find that you may define sets of matching characters (using square brackets []).In the set, you may put character ranges and character properties. Then you may search Unicode Character Database to look for the properties you need; and also examine list of character properties.

I’m not an expert here; but brief look at these gives this first proposal: search for Ideographic characters, and for block of Halfwidth And Fullwidth Forms (including punctuation used in CJK, like ？ and ，). The former is \p{Ideographic}; the latter is either \p{Block=Halfwidth_And_Fullwidth_Forms} (\p{Block=Half_And_Full_Forms}), or [\uFF00-\uFFEF]. So, the shortest combined regex for these would be

[\p{Ideographic}\uFF00-\uFFEF]

Indeed, searching for it requires that [x] Regular expression be checked in Find & Replace dialog.
Note that halfwidth/fullwidth punctuation would not be found unless [x] Match character width is checked in the dialog, too. Having both of them checked, the regular expression from above put into Find: box, and with empty Replace: box, click Replace All button to remove all these characters from the document.

But this has at least one problem: Chinese text might have non-Chinese parts. E.g., google-translating your question to experiment with:

How can I remove the Chinese characters without removing the English alpha characters? Also how the same be done to remove the English alpha characters and leaving the Chinese characters?

I got this:

如何在不删除英文字母字符的情况下删除中文字符？另外，如何删除英文字母字符并保留中文字符呢？

It contained Ideographic characters, full-width punctuation, but also one normal space character. Performing the replacement described above in this text would leave this space. It could be no problem for spaces (one might then check for leading/trailing/repeated spaces); but there’s a possibility that characters from general-use blocks were used also for punctuation, or numbers. I cannot advise how to overcome this.

And taking into account the problem mentioned above, I don’t know how to perform the opposite task: remove everything non-Chinese. Simply negating the regular expression above to be

[^\p{Ideographic}\uFF00-\uFFEF]

would also find (and remove) all non-full/halfwidth punctuation, numbers, spaces, and who knows what from Chinese text.

Another approach would seem to be to search for text language, but I didn’t find a way to find Chinese text in my testing using this method even in Writer (while the topic here is Calc).

gabix · December 31, 2019, 8:58am

remove everything non-Chinese

As I understand the task, it is not about everything non-Chinese, but about English alpha characters. Assuming that the asker means plain letters of the English alphabet, the find string is merely

[a-zA-Z]

Of course, one can use [:alpha:] etc.

If, however, you really need everything non-Chinese, something like:

[0-𒊇]

might be used.

mikekaganski · December 31, 2019, 9:06am

As I understand the task, it is not about everything non-Chinese, but about English alpha characters

… which would keep parentheses, dots, and slashes in the screenshoted part (e.g., D1; all cells in column E…)

[0-𒊇]

… which includes ~all common Chinese characters (𒊇 being U+12287, which is non-BMP): e.g., 不 U+4E0D.

[:alpha:]

… matches any alphabetic character, including Chinese

gabix · December 31, 2019, 9:17am

Of course. The question is: what does the asker mean?

By the way, I need to correct the find string for everything non-Chinese, should be:

[0-➾]

I have tested on a real text, it removed all Cyrillic/Latin letters and common European punctuation marks, leaving only tabs and CJK characters. Well, perhaps, it may need some further adjustment. Regular expressions are a fun toy to experiment with

russelld · December 31, 2019, 10:03am

Mike Kaganski,
Thank you for your detailed explanation.
I can now see how to tackle the removal of non-Chinese characters using regular expressions.

r