# Do regular expressions understand different languages in a document?

I have maybe hundreds of English commas left after bulk word replacements. As I have replaced English words (typically proper nouns) with Persian words (CTL), the leftover commas between words are English commas (,) but I need Persian commas (،) between Persian words. A simple bulk replacement of all first type commas with the second type is not the best solution, as I have English footnotes with English commas that need to be preserved. Is there any capability in regular expressions to replace only those English commas that are placed between Persian (CTL) words, and not English words? I hope the following pictures help better clarify what I mean.

Figure 1. English words before initial replacement

Figure 2. Commas left after Persian word(s) being replaced for English words (Note that year numeral was changed according to LO context settings)

Figure 3. The highlighted comma is desired type of comma (here added manually)

edit retag close merge delete

Sort by » oldest newest most voted

I don't think it is possible to search by language, but surely you can search by scripts using Regular Expressions. Something like this, that uses "Look-ahead" and "Look-behind" assertions, should work

(?<=[\p{script=YOUR-SCRIPT}]), (?=[\p{script=YOUR-SCRIPT}])


where YOUR-SCRIPT is the script you don't want to select (like, latin, cyrillic, arabic, etc.). Notice the space after the comma.

See: ICU Regular Expressions for more details on this expressions.

more

When you have to do this kind of stuff, your first job is to make sure that you have a copy of the original file, so that if something goes wrong, you still have the original file. Next thing is to plan ahead. Do things one at a time and note any important effects that you might not have thought of. I've been there. :) Then you may find that it's better to do this job in a few steps: first find all occurrences of bracket, author name, comma, year - in regexp that should be this:

(([A-Z][^,]*), ([0-9]{4})

If you want to include the closing bracket, you need to do more, of course. In regexp brackets enclose blocks that you want to regard as a whole, so when you want to enter a bracket, it has to be escaped with a \

Now for the coding part: in the Replace box enter (|||| represents your code, a combination of letters that does not occur anywhere in the text)

($1||||$2

This will leave the name and year alone but replace the comma by a code. Now you can (batch) replace the author names and years, and the special codes by the Persian commas.

more