How can I replace non-breaking hyphens?

dragan.marinovic.9 · July 2, 2014, 3:29pm

This is the link:
http://www.authorama.com/we-the-media-3.html

These are the paragraphs:

Journalists can use SMS in any number of ways; again, this is much more common outside the U.S. The first inkling among journalists of China’s SARS epidemic came in an SMS from sources inside the medical profession there. Was this significantly different than simple phone calls in its fundamental nature? Not really. But in a place where being overheard can lead to big trouble, it’s much safer—as long as one’s messages aren’t being intercepted—to simply send a quick SMS.

Over time, perhaps the most important value of SMS will be of the kind described by Howard Rheingold in his prescient book Smart Mobs:52 a self-organizing information system in which individuals and small groups tell each other important news. Rheingold relates, among other examples, how citizens in the Philippines used SMS to organize and overthrow a corrupt government.53 On a more prosaic level, young people in countries with advanced wireless communications have used SMS for social organization. We’re just at the beginning of this tech-nology’s development. As networks and handsets improve, SMS will give way to video messaging, with yet to be understood implications.

Question

I’ve managed to replace Non-breaking space with:
[:space:]
[:space:]+
and
Hyphen / Non-breaking Hyphen with:
[:hyphen:]
[:hyphen:]+

BUT, if you faste those two paragraphs in Writer you’ll find two word: “significantly” and “countries” with non-breaking hyphen which is if you compare it to “normal” non-breaking hyphen “a little more indented”. I’ve tried to paste Unofrmatted text but I get “a little more indented” non-breaking hyphen again.

And I haven’t managed to replace it and I searched all web.

Any soulution to replace it or “not getting it” at all?

Regina · July 2, 2014, 7:53pm

It is not a “non-breaking” hyphen but an “optional” hyphen. It is called “SOFT HYPHEN” or “SHY” too. It has code point U+00AD.

oweng · July 3, 2014, 4:05am

@Regina, how are these characters found? Using \xNNNN for the code-point has never worked as far as I can tell. Seems to be a broken aspect of the ICU regex engine.

Regina · July 3, 2014, 7:42am

Do not use regular expressions, but use the character literally. Right click the input field to get its context menu, click on item “Special character”. Select the character from the table. It is between ¬ and ®.

Regina · July 3, 2014, 5:16pm

Regex itself works with Unicode code points, but not for SOFT HYPHEN. For example write ABCDEFGHIJKLM. Then you can find the character J with either of this syntax \u004a or \x{004a} or \x4a or \0112.

oweng · July 4, 2014, 1:03pm

@Regina, I will provide a separate answer for regex forms that work here as it takes more space than a comment will allow.

oweng · July 4, 2014, 1:08pm

Under GNU/Linux x86_64 running v4.1.6.2, v4.2.5.2, and v4.3.0.2 (current regular expression engine) these forms work in finding “J”: \x4A, \x{004A}, and \0112. These forms do not: \x004A and \112.

These forms work for soft hyphen (U+00AD / 0173): \xAD and \x{00AD}. These forms do not: \x00AD, \0173, and \173.

Under GNU/Linux x86_64 running v3.5.7.2 (old regular expression engine) these forms work in finding “J”: \x4A and \x004A. These forms do not: \x{004A}, \0112, and \112.

This form works for soft hyphen (U+00AD / 0173): \x00AD. These forms do not: \xAD, \x{00AD}, \0173, and \173.

As is often the case, there appears to be little that is regular about the accepted expression form. This is also in conflict with what is stated on the List of regular expressions wiki page. Other operating systems may differ again. IMO all the indicated forms should work (although the curly braces are superfluous), regardless of the character in question.

AlexKemp · February 27, 2016, 11:05am