Using regular expression to clean up texts?

I need something automated like:

From this (courtesy: Revolution OS):

0

00:00:04,180 → 00:00:05,700

I was at Agenda 2000

1

00:00:06,300 → 00:00:09,100

and uh, one of the people who was there

was Craig Mundie,

2

00:00:09,100 → 00:00:12,180

who is some kind of

high mucky muck at Microsoft,

3

00:00:12,700 → 00:00:16,180

I think uh, vice-president of consumer products

or something like that.

and so on…

To this:

I was at Agenda 2000, and uh, one of the people who was there was Craig Mundie, who is some kind of high mucky muck at Microsoft, I think uh, vice-president of consumer products or something like that.

and so on for hundreds of lines.

For this, I have to first remove the counts from 0,1,2, … , which I did using ^[0-9]*$

Then I want to remove the time-period, e.g., 00:00:12,700 → 00:00:16,180 , and so on.

I tried with 0[0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9] → 0[0-9]:[0-9][0-9]:[0-9][0-9],[0-9][0-9][0-9]$ and could of course remove them, but that is too particular, and not at all general.

What would the most general of codes, with the least possible no. characters using regular expressions, to remove the time-periods using Find-Replace?


            RESPONSE TO RPR's POST BELOW

Hello, rpr!

Is further brevity possible? Without counting the places?

The concerned help was referred to. Not sequential.

Assume there are these following link patterns:

    " target="_blank">Argentina</a></
    
    " target="_blank">Austria</a></
    
    " target="_blank">Belgium</a></
    
    " target="_blank">Bolivia</a></
    
    ...
    
    " target="_blank">Costa Rica</a></
    
    " target="_blank">Czech Republic</a></
    
    ... 

and so on.

I can remove these by using:

" target="_blank">[A-Z .]*</a></ 

or

" target="_blank">[A-Z. ]*</a></

But why not " target="_blank">*</a></ simply ?

And how do I replace empty paragraph mark, i.e., ¶ ?

In your case, why not ^[0-9:,*->]*[0-9:,*]*$ instead of ^[0-9:,]{12} → [0-9:,]{12}$ ?

And thank you! I thought I wouldn’t receive any response this time


To RPR: From: bkpsusmitaa At 17:34 IST (+5:30 Hrs. UST)

Thank you for your extensive and elaborate reply. Your understanding of the issue is admired. So I have to use judiciousness, on a case by case basis, the regex I need to use to get the desired output.

And regarding empty strings to replace ^$, what is that? I don’t know. Is it blank in the Find (sorry: would be Replace) field?

In the meanwhile, my regards for your support.

[Reply to RPR at 08:03 Hrs. IST +5:30 UST]

Yes, I now understand your statement “with empty string”. I got confused unduly :slight_smile:

I write a file with garbage strings, just for testing.

dfssddf

otte



adiojrr

rtjewup

and save the file as test.odt. Then I run the ^$ Find-Replace with the replace field blank.

Thank you for your post. I really appreciate your kind disposition.

In the meanwhile, please find time to test the issue: target="_blank">*</a></ populating a blank ODT file with the strings given.

Reply to RPR at 07:33 IST on 10th August 2015

Yes, I understood. Thank you. Actually, coming from MSoffice which was the office software used by government offices, I had difficulty to assimilate the differences between the MSO and the LibO. I am still learning. Old Habits Die Hard :slight_smile:

Thank you once more. Time to close this thread. I hope this thread will help people in time to come.

Editing responses into the question has made this very confusing to read. Questions should be questions; answers should be answers!

sorry :frowning: for the confusion. Have formatted the content to alleviate the confusion. Hope it helped.
Thanks for pointing out.
Regards

You can use the following shorter regular expression for finding time periods in that text:

^[0-9:,]{12} --> [0-9:,]{12}$

LibreOffice help gives basic explanation of regular expressions.

It is important to understand the meaning of special characters in regular expressions. For example, ***** matches a string of zero or more of the preceding element (character or expression), which differs from the meaning of * in glob patterns (used for matching file names in a command shell, for example) where it matches any string of characters.

Also note that when * is inside [], then it does not have a special meaning. So,

[0-9:,*]

matches any of the following characters: 0 1 2 3 4 5 6 7 8 9 : ; *

while

[0-9:,*]*

matches a string of zero or more of any of those characters.

When you construct regular expressions for removing some strings from the text you have to make sure that such a regular expression will not also match parts of the text that want to keep. For example, the following regex

^[0-9:, >-]*$

matches lines like

00:00:06,300 --> 00:00:09,100

but it also matches the following lines:

123
123-456
123 > -456

So, if you have such lines in your text and you want to keep them, then you must construct a more suitable regex. That’s why I suggested ^[0-9:,]{12} → [0-9:,]{12}$

Regarding removing paragraph marks from the text, in LibreOffice you can do the following:

  • Replace regex ^$ with empty string to remove empty paragraphs from the text. LibreOffice help explains that ^ matches the beginning of a paragraph, while $ matches the end of a paragraph. So, ^$ matches empty paragraphs (without any characters in it).
  • Replace regex $ with empty string to remove paragraph marks, i.e. to join paragraphs into one paragraph.

All these regular expressions are used in LibreOffice in this way: you write (or paste) the regex in the Search For field, leave the Replace With field empty, and then click Replace All to remove (filter out) all parts of the text that matches the regex. (You probably know that you can do find & replace on the whole text or on the current selection only.)

To remove patterns like the following

" target="_blank">United States</a></

I would use the following regex in the Search For field:

" target="_blank">[ A-Za-z]*</a></

or

" target="_blank">[ A-Za-z]+</a></

where

  • [ A-Za-z] matches a space character or any uppercase or lower case letters from the English alphabet
  • the * character matches a string of zero or more of the preceding character
  • the + character matches a string of one or more of the preceding character

I hope you now see why the following regex is not good for matching the above patterns:

" target="_blank">*</a></

If you have words that also contain non-English letters, you can match them with the following regex:

[:alpha:]+

while if you want to match words and spaces between them, you can use

( |[:alpha:])+

where

  • | matches any of the characters that occur before and after | (you can read it as “or”: a|b means “match a or b”)
  • () the parentheses are used for grouping of characters so that you can create a subexpression and apply a special character to that subexpression (in the above regex the special character + is applied to the subexpression)

Again, I recommend reading carefully LibreOffice help on regular expressions. To learn how to use regular expressions, the best way is trying to construct them and use them for search & replace. Start with simple expressions to understand various special characters and then construct more and more complex expressions. On Linux/Unix systems regular expressions are widely used wherever there is a need to find and replace some text patterns.

Reply to rpr added to the post itself.