Replace blocks of text in librecalc using regexp

goldofthesun · March 3, 2019, 10:20pm

I’m new with regexp. I am working with a buggy program that exports text to spreadsheet form. The bug creates duplicate blocks of text in a cell, the duplicate follows a double paragraph space. Example is shown below. Can regexp and find/replace get ride of the duplicate text? Can anyone assist with the regexp needed?

eg. %this appears in a single cell

Paragraph 1

Paragraph 2

Paragraph 3

Paragraph 1 %these are all repeated after the 2 paragraph spaces

Paragraph 2

Paragraph 3

%I’d like the output to just show Para 1, Para 2 and Para 3 without the duplicate

Sample data file here

mikekaganski · March 4, 2019, 5:31am

Try this (edited):

^([^\00]+)\n+\1$

replaced with

$1

Since null character \00 cannot occur in cells, and “paragraphs” are strings separated by line breaks \n, you may use the [^\00] pattern to include all possible characters into the search - something that doesn’t work with usual . which doesn’t match \n.

This will also take into account that the second (duplicate) data block also ends with newline.

goldofthesun · March 4, 2019, 9:24pm

Thanks for the reply Mike. Unfortunately it does not seem to work. Do you have any other suggestions?

mikekaganski · March 4, 2019, 9:27pm

Yes. My suggestion is “please provide a sample data file to see what is different from what was described in the question”

goldofthesun · March 4, 2019, 10:00pm

Sure! I’ve attached it to the question. I should have noted, there was no error message, the search string just did not seem to be recognised.

mikekaganski · March 5, 2019, 4:39am

Thanks! Of course, there were details: the duplicated data contained a trailing newline; so strictly speaking, your cells contain two identical blocks separated by a single empty paragraph:

(para1<nl>para2<nl>...paraN<nl>)<nl>(para1<nl>para2<nl>...paraN<nl>).

goldofthesun · March 5, 2019, 11:23pm

Success! Much appreciated. Thanks for shining the light as I stumbled around in the dark = ) I’m a newbie on this forum so don’t have the points to upvote this very useful answer. One additional question, if it’s just one line break which separates duplicates, how does the search string change?

mikekaganski · March 6, 2019, 3:37am

It was \n\n which I used initially to separate parts; now I use \n+ to match one or more newlines between identical parts. Just in case. If it were strictly one, I’d use \n there.