Ask Your Question
0

Replace blocks of text in librecalc using regexp

asked 2019-03-03 23:20:28 +0200

goldofthesun gravatar image

updated 2019-03-04 23:02:11 +0200

I'm new with regexp. I am working with a buggy program that exports text to spreadsheet form. The bug creates duplicate blocks of text in a cell, the duplicate follows a double paragraph space. Example is shown below. Can regexp and find/replace get ride of the duplicate text? Can anyone assist with the regexp needed?

eg. %this appears in a single cell

Paragraph 1

Paragraph 2

Paragraph 3

Paragraph 1 %these are all repeated after the 2 paragraph spaces

Paragraph 2

Paragraph 3

%I'd like the output to just show Para 1, Para 2 and Para 3 without the duplicate

Sample data file here

edit retag flag offensive close merge delete

1 Answer

Sort by » oldest newest most voted
0

answered 2019-03-04 06:31:12 +0200

updated 2019-03-05 06:25:49 +0200

Try this (edited):

^([^\00]+)\n+\1$

replaced with

$1

Since null character \00 cannot occur in cells, and "paragraphs" are strings separated by line breaks \n, you may use the [^\00] pattern to include all possible characters into the search - something that doesn't work with usual . which doesn't match \n.

This will also take into account that the second (duplicate) data block also ends with newline.

edit flag offensive delete link more

Comments

Thanks for the reply Mike. Unfortunately it does not seem to work. Do you have any other suggestions?

goldofthesun gravatar imagegoldofthesun ( 2019-03-04 22:24:20 +0200 )edit

Yes. My suggestion is "please provide a sample data file to see what is different from what was described in the question" :-)

Mike Kaganski gravatar imageMike Kaganski ( 2019-03-04 22:27:54 +0200 )edit

Sure! I've attached it to the question. I should have noted, there was no error message, the search string just did not seem to be recognised.

goldofthesun gravatar imagegoldofthesun ( 2019-03-04 23:00:55 +0200 )edit

Thanks! Of course, there were details: the duplicated data contained a trailing newline; so strictly speaking, your cells contain two identical blocks separated by a single empty paragraph:

(para1<nl>para2<nl>...paraN<nl>)<nl>(para1<nl>para2<nl>...paraN<nl>).
Mike Kaganski gravatar imageMike Kaganski ( 2019-03-05 05:39:40 +0200 )edit

Success! Much appreciated. Thanks for shining the light as I stumbled around in the dark = ) I'm a newbie on this forum so don't have the points to upvote this very useful answer. One additional question, if it's just one line break which separates duplicates, how does the search string change?

goldofthesun gravatar imagegoldofthesun ( 2019-03-06 00:23:29 +0200 )edit

It was \n\n which I used initially to separate parts; now I use \n+ to match one or more newlines between identical parts. Just in case. If it were strictly one, I'd use \n there.

Mike Kaganski gravatar imageMike Kaganski ( 2019-03-06 04:37:29 +0200 )edit
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2019-03-03 23:20:28 +0200

Seen: 49 times

Last updated: Mar 05