How can I select all Japanese text at once in a document that contians Japanese and computer code?

Sc123 · April 19, 2021, 2:36pm

I have a writer document that is ripped from a game file. It has character dialogue in the form of Japanese text, and also much computer code. I am interested in selecting all the Japanese text at once, and copying it to another document to translate it. Please let me know how/if this is possible.

anon73440385 · April 19, 2021, 2:52pm

If your paragraphs are defined using the proper language use Edit -> Find and Replace, click button Format and select the language you want find in tab Font and click button Find All. If your paragraphs are not properly styled, I can’t see a way (may be button Attributes in Edit -> Find and Replace could give a clue).

Sc123 · April 19, 2021, 3:12pm

I did as you said, and it says “search key not found”. Does that mean the paragraphs are not properly styled as you said?

mikekaganski · April 19, 2021, 3:46pm

question/223178

Sc123 · April 19, 2021, 3:57pm

@mike kaganski Well I’m trying to copy/cut and paste the Kanji, not delete them. Also I’m not sure if that accounts for Hiragana/Katakana? They are not found in China. Finally, it says “search key not found” when searching with the “[一-龥]” mentioned in that question.

mikekaganski · April 19, 2021, 4:17pm

I’m trying to copy/cut and paste the Kanji, not delete them

Something tells me that both to copy, and to delete, one needs to select first. With some imagination, one could find hints how to select things there in the referenced article.

Also I’m not sure if that accounts for Hiragana/Katakana? They are not found in China

Please take a look at my answer there, where I discuss the character classes. And there are references to Unicode database with discussion of character properties.

Finally, it says “search key not found” when searching with the “[一-龥]” mentioned in that question.

Have you used regular expressions, as mentioned there?

ajlittoz · April 19, 2021, 2:53pm

If your document is ripped off a game file, it is likely a .txt file without any embedded formatting directive. I hope it is UTF-8 encoded. It could use other encodings like JIS X 0201 or JIS X 0208.

I assume it is UTF-8 but the basic principle can be adapted to other encodings.

The easiest tool for your job is to use a text editor (the kind of editor used for programming) with regular expression find & replace. Don’t use Writer which is not versatile enough for that.

Computer code usually uses ASCII characters (U+0000 to U+007F). Your Japanese dialogues will normally not contain ASCII characters (they can of course, but you’ll have to check afterwards if your extraction seems to be lacking elements). Thus, you can use regexp to delete all ASCII characters from the text. What is left should be your non-Latin dialogues.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

In case you need clarification, edit your question (not an answer which is reserved for solutions) or comment the relevant answer.

Sc123 · April 19, 2021, 3:01pm

Let me clarify: The file was originally in “.mes” format, not “.txt”. The encoding is in Shift-JIS. I selected the encoding as as such in Noetpad++, and copy/pasted all the text to writer. This translation will be text only(ie a script to be posted online at a later date), as the game uses many propriatary formats, and I have not been able to successfully repack the data, so I do not need the code intact. Since Noetpad++ was the editor I used, and considering my clarifications, should I ask their help section?

ajlittoz · April 19, 2021, 3:37pm

I don’t pratice Window$ but I think Notepad++ has regexp capability. If you can feed it with something equivalent to [^!-~] i.e. “not EXCLAMATION MARK to TILDE” (keeping spaces and all controls like linefeed and return), replacing with “nothing”, that should do the trick.

If your dialogues do not contain spaces (I have no idea how kanji, hiragana or katakana is formatted), you can use [^ -~]. Similarly, if you want to keep some punctuation, adapt the content of the set.

Erratum: Shift JIS is an 8-bit encoding. This could make things easier. The regexp is [^!-[\[-}] to keep the yen sign at 0x5C and the overline at 0x7E. Again, this regexp keep controls and spaces which can be removed in a second step.

Lupp · April 19, 2021, 4:19pm

If the text already is loaded into a Writer document, the encoding is no longer a problem.
Assuming any sequence of characters with codeplaces above U+00FF is “exotic” (Japanese in this case), but such text may probably contain one or two consecutive ASCII characters (like ordinary punctuation), Writer could find it with F&R/RegEx ([^\u0000-u\00FF]|(?<=[^\u0000-u\00FF])..?(?=[^\u0000-u\00FF]))*. Longer sequences of ASCII characters (than two) would be excluded.
Negative: ASCII except sequences shorter than 3 would be found by ([\u0000-u\00FF]){3,}