Writer does not differentiate U+22EF and U+2026

I am asked to post-process a lot of .doc files that contains both U+22EF Midline Horizontal Ellipsis and U+2026 Horizontal Ellipsis. Somehow they are used as markup. I am supposed to replace them with different things. I tried search and replace. But Writer treat them as the same character. I tried to export them to .txt file in the hope that I can use sed. But the resulting .txt file already replaced all U+22EF with U+2026.

Is there settings that I am not aware of, so that Writer differentiate them?

There is no reason for Writer to change characters in a document. Eventually verify Tools>AutoText>AutoText Options, Options tab to disable any “suspect” transformation. Since you’re converting .doc files, the conversion process may also alter text. Go to Tools>Options, LibreOffice Writer>Compatibility to check the rules.

Also, are 200% sure that U+22EF is present in the original document? Put the cursor at immediate right of a supposed midline ellipsis and press Alt+X to see the encoding. Do you read U+22EF? If not, either it is not used in the original file or the DOC input filter mapped it to U+2026. How old are your .doc? Were they encoded Unicode or did they use legacy 256-character encoding? In the latter case, there were only one ellipsis in the repertoire.

Last, mention OS name and LO version.

2 Likes

Maybe the Midline Horizontal ellipsis was faked by making a normal ellipsis superscript? Find and Replace first for Format superscript automatic, then Find and Replace the remaining ones.

1 Like

ask113925.odt (10.1 KB)