# Replace blocks of text in librecalc using regexp

I'm new with regexp. I am working with a buggy program that exports text to spreadsheet form. The bug creates duplicate blocks of text in a cell, the duplicate follows a double paragraph space. Example is shown below. Can regexp and find/replace get ride of the duplicate text? Can anyone assist with the regexp needed?

eg. %this appears in a single cell

Paragraph 1

Paragraph 2

Paragraph 3

Paragraph 1 %these are all repeated after the 2 paragraph spaces

Paragraph 2

Paragraph 3

%I'd like the output to just show Para 1, Para 2 and Para 3 without the duplicate

Sample data file here

edit retag close merge delete

Sort by » oldest newest most voted

Try this (edited):

^([^\00]+)\n+\1$ replaced with $1


Since null character \00 cannot occur in cells, and "paragraphs" are strings separated by line breaks \n, you may use the [^\00] pattern to include all possible characters into the search - something that doesn't work with usual . which doesn't match \n.

This will also take into account that the second (duplicate) data block also ends with newline.

more

Thanks for the reply Mike. Unfortunately it does not seem to work. Do you have any other suggestions?

( 2019-03-04 22:24:20 +0200 )edit

Yes. My suggestion is "please provide a sample data file to see what is different from what was described in the question" :-)

( 2019-03-04 22:27:54 +0200 )edit

Sure! I've attached it to the question. I should have noted, there was no error message, the search string just did not seem to be recognised.

( 2019-03-04 23:00:55 +0200 )edit

Thanks! Of course, there were details: the duplicated data contained a trailing newline; so strictly speaking, your cells contain two identical blocks separated by a single empty paragraph:

(para1<nl>para2<nl>...paraN<nl>)<nl>(para1<nl>para2<nl>...paraN<nl>).

( 2019-03-05 05:39:40 +0200 )edit

Success! Much appreciated. Thanks for shining the light as I stumbled around in the dark = ) I'm a newbie on this forum so don't have the points to upvote this very useful answer. One additional question, if it's just one line break which separates duplicates, how does the search string change?

( 2019-03-06 00:23:29 +0200 )edit

It was \n\n which I used initially to separate parts; now I use \n+ to match one or more newlines between identical parts. Just in case. If it were strictly one, I'd use \n there.

( 2019-03-06 04:37:29 +0200 )edit