Using Writer to find or see non-printable characters

doktoroblivion1 · January 12, 2022, 4:20pm

I am a long time user of LibreOffice and am currently on 7.2 (stable). I currently have a document which would lend itself well to a find and replace task, the problem is, I am not sure what the character is that I want to replace since I cannot see its actual hex or octal representation.

In the screenshot above, to the right is my LO session with the text in question marked. The character displayed as a backward P is the character in question, in fact, I was under the impression it was a LineFeed character (x0A), so I tried searching for that and that only found what looks like a CarriageReturn character, the funny blue character displayed as a down and left pointing arrow.
If I copy and paste the entire text from this table cell and view it in VIM, I see what you can see in the screenshot above on the left, those characters are LineFeed characters (x0a).
My question is, how do I search for this?
Using regular expression for x0A does not seem to work, it skips these characters and finds the CRs, but not the real LFs. There may be a further wrinkle in the fact that the heading format before the backward P after the marked section changes to a new format as well. I was able to create a search for that and find it, but find/replace will only let you find and replace a format with another format, it will not let you delete the character (I suspect the odd character is some strange format character of unknown length that I cannot see).
I searched the documentation for help but it is rather vague and not really that helpful in this area. I also searched it to see if there was a hex mode capability or view source mode in LO, but couldn’t find that either, just to make sure what kind of character I was actually looking at, since the backward P is not really helpful at all (would be nice if I could see its hex value if I moused over it, just saying).

Version: 7.2.4.1 (x64) / LibreOffice Community
Build ID: 27d75539669ac387bb498e35313b970b7fe9c4f9
CPU threads: 8; OS: Windows 10.0 Build 19043; UI render: Skia/Raster; VCL: win
Locale: en-US (en_US); UI: de-DE
Calc: threaded

mikekaganski · January 12, 2022, 4:42pm

The pilcrow symbols are not real text characters. In Writer, the text is internally stored as an array of separate Paragraph objects, not as a single flow of text. Each paragraph has its text, but there’s no characters separating or ending paragraphs.

This is by the way the reason why our built-in Find & Replace is limited to searched inside a single paragraph; and also why in our regular expressions, paragraphs are treated specially, in a non-standard way.

doktoroblivion1 · January 12, 2022, 6:07pm

Yeah, I think I came to the conclusion there is not much hope for this to work, given that there is so much function embedded in those pilcrow symbols. Thanks!

ajlittoz · January 12, 2022, 4:59pm

What is displayed as pilcrow (the mirrored P) is not a real character in the text flow but a visual clue about an “object” in the document internal representation.

When you copy a part of the text and paste it into a text editor like vim, a conversion occurs because vim does not understand the complex formatting representation in Writer. Text is pasted as “unformatted text”, i.e. a sequence of Unicode character with all “embellishments” stripped, like the bold markup.

The pilcrow stands for a paragraph mark (and other data). When it is pasted into vim, it is “over-simplified” to a single Unix line end LF = U+000a = LINE FEED. You can’t rely on what vim shows to apply Find&Replace in Writer.

The paragraph mark has no escape encoding in Find&Replace because regular expressions are limited in scope to a paragraph contents. The start and end (=paragraph mark) locations are boundaries of the scope and cannot be replaced.

There is however an exception: if your Find: criterion is a single $, it will hit on the paragraph mark and you can replace it with anything. In all other cases, $ acts as the end location of the paragraph, the boundary, not the paragraph mark itself which is outside the hit; therefore, it can’t be replaced.

Perhaps if you explain what you want to achieve, we could suggest a solution.

doktoroblivion1 · January 12, 2022, 6:04pm

What I would like to do is delete the area already marked in the above screenshot. This can be done manually by placing the cursor before the pilcrow at the end of the marked area and pressing backspace. I had started doing this but found it very tedious. My hope was that I could automate this by a find/replace that looked for what is marked and replace it with NOTHING, hence deleting all occurrences of this unneeded space at the end of these cells, in affect, cleaning them up and shortening the space in these cells.

Oh BTW, the find criterion of $ does find this area, but others as well so I could not use it to single out the area I am interested in for the change.

mikekaganski · January 12, 2022, 6:07pm

^$ Finds an empty paragraph.

… and replace with nothing.

ajlittoz · January 12, 2022, 6:14pm

@mikekaganski was faster than me to give the answer.

Your screenshot suggests your are working on some genealogical data. Have you considered using a dedicated application which would offer more comfort in managing such data? I recommend GRAMPS which is a free and open app ported on many platforms.

doktoroblivion1 · January 12, 2022, 6:16pm

@ajlittoz Yeah, but this is just a temporary file that was generated by a website, so I prefer to leave it as-is and just clean it up for easier reading. As for the rest of my serious work, that is all done using LO Global Documents, etc… Thanks for suggesting though.

doktoroblivion1 · January 12, 2022, 6:24pm

@mikekaganski I tried that and it did not work. If I place my cursor before the area of interest with that find criteria and press Enter, it finds the first empty paragraph outside the table. If I repeat from the start of the document, same results.

ajlittoz · January 12, 2022, 6:32pm

Works for me: F&R processing enters table and does its job.

CAVEAT: a table cell creates a subdocument and the same rules applies as in the main document. In a document, you can never delete the last paragraph mark.
Therefore, if the last paragraph in a cell is empty, it can’t be deleted with F&R. And this seems to be the situation you’re facing.

In general, pasting something from the web always ends up with formatting issues because of the differences in founding principles between HTML and ODF.

doktoroblivion1 · January 12, 2022, 9:57pm

@ajlittoz @mikekaganski One last question to both of you. Is it possible to create a regular expression in FIND to find the first non-printable after Mother: and any name containing blank characters in it? If you look at the original screenshot if I can FIND the pilcrow right after any Mother, that would suffice!

ajlittoz · January 13, 2022, 8:41am

AFAIK no.
You can create a regexp line Mother:.*$ which will highlight from “Mother:” up to but not including the paragraph mark. As already mentioned, as soon as you match on some paragraph contents, you are strictly inside the paragraph and can’t have any action on the mark.
You may try a 3-stage strategy (I haven’t checked it) where:

you change all para marks $ to some unique pattern like ===
All marks except the last ones (document and cells) are replaced.
you replace empty “paragraphs” ====== for intermediate ones, ===$ for the last one with nothing, effectively erasing them
you restore marks, replacing === with \n
~~This is where I have a doubt, not sure if \n will give a paragraph mark or a mere line break~~ (special thanks to @mikekaganski)

mikekaganski · January 13, 2022, 9:20am

As described in our regex documentation:

\n … A paragraph break that can be entered with the Enter or Return key when in the Replace text box in Writer.

So yes, it will break paragraphs as intended when used in replacement.

doktoroblivion1 · January 13, 2022, 2:07pm

@mikekaganski @ajlittoz Okay thanks…I will fool around with it a bit, but will probably end up doing it manually.