Writer: delete all paragraphs that do not contain a phrase

Hello all,

I have a long text with many paragraphs. Each paragraph is the path to a file or a directory. Some of the paths are to PDF files, like: Path\booktitle.pdf

I want to delete all paragraphs that do not contain the phrase .pdf so that I am left with a text containing just paths to books. Is there a clever way to do this in Writer? Or maybe Calc? There is a maximum of one .pdf per paragraph.

My version is LibreOffice 6.0.2.1

Many thanks,
Ruud

What operating system do you use? There could be some very simple solutions using plain text files, but the tools available depend on the operating system. Thanks. It would also help enormously if you could give some “sample paragraphs” – is it only the path statement? does the .pdf element always come last? So a couple “real” samples would help a lot.

I use windows 10

I have 7 list.txt files made for 7 disks with the dir command (on a windows 7 system)

Each list.txt contains many thousands of paths to all kinds of files and directories.

My friend wants to reduce the files so that they contain only the paths to her PDFs. I now realize that the search and replace option in writer can’t do that easily.

@HappyUser1 - You should be able to use a grep tool for Windows with your text files and it would be a doddle.

So many good answers! I will study the grep tool!

@HappyUser1 - my sense is still that grep is the “right tool” for this job. Something like $ grep pdf data.txt > out.txt would save all PDF “lines/paragraphs” to their own textfile in one go. See SuperUser for help on using this method on Windows. Simples. :slight_smile:

(Edited)

Since I didn’t find a way to do this efficiently by built-in tools I decided to write a raw piece of code for the purpose. (There was a second similar request at the same time.)
The code below can be refined in many ways. First of all it may be desirable to work on the current selection instead of the complete text.

Sub removeNotHavingPars(Optional pHas As String)
If Not ThisComponent.SupportsService("com.sun.star.text.TextDocument") Then Exit Sub
If IsMissing(pHas) Then 
    pHas          = ".pdf" ' May need refinement!
End If
theText       = ThisComponent.Text
theCursor     = theText.CreateTextCursorByRange(theText.Start)
Do While theText.CompareRegionEnds(theText, theCursor)<0
theCursor.GotoEndOfParagraph(True)
If  InStr(theCursor.String, pHas) = 0 Then
    theCursor.String = ""
End If
If theCursor.String ="" Then 
    theCursor.GoRight(1, True)
    theCursor.String = ""
Else
    theCursor.GoRight(1, False)
End If
Loop
theCursor.GotoStartOfParagraph(True)
If theCursor.String="" Then
    theCursor.GoLeft(1, True)
    theCursor.String=""
End If
End Sub  

You find it demonstrated in this attached text document.

IF your text has a pattern like this (please clarify!):

Path\booktitle.pdf

Path\booktitle

Path\booktitle.pdf

etc.

Then with a simple regular expression search you can select all your “non-PDF” lines with “Find all” and delete them, leaving you with only those lines with the path to the PDF. The regex is:

^(?=.*?Path)(.(?!pdf))*$

N.b.!!!Path” in that regex statement is simply reflecting OP’s “dummy text” used in my test file. It will need to be adjusted for “real world” use. See OP’s comment to this answer, below. ]

And this is how it looks:

If your text is embedded in a longer paragraph, there is still likely to be a regex expression that will capture it, but we need more details before offering help with that. Please edit the question with more information.

@floris v had already suggested a solution based on RegEx. Seems it was withdrawn meanwhile.
Anyway the OQ wanted to delete the praragraphs not containing the “phrase”. Using ‘Find All’ this way would require a subsequent copy/paste. This would, however, result in a single paragraph containing all the findings.
Can you also suggest a Regex to FindAll the paragraphs not containing one of zthe critical strings for deletion?

@Lupp - Hmmm … I see. One could break up the single paragraph containing all the findings easily enough, but perhaps there is a negative lookahead or something. I’ll do a little thinking…

(30 mins later) … And I think I cracked it! See what you think.

Oh that is interesting! It’s just that the paths in the files I have aren’t all called Path. Actually, none of then is. They are all sorts of paths, like D:\TopDirName\SecondDirName\filename.pdf

^(?=.?Path)(.(?!pdf))$ selects nothing, but aftera small change, removing Path,

^(?=.?)(.(?!pdf))$ selects precisely those lines that need to be deleted

Awesome! Thanks!

Yes. I also had tried it with a negative lookahead, but obviously I missed a point. Your suggestion works fine (except for the constant used for the path). To allow for the “word” pdf if not occurring together with the extension delimiter ., I would now suggest to use ^(.(?!\.pdf))*$. (There is the tiny drawback of the empty paragraphs remaining.

Thank you Lupp! That was helpful.

Using your hint, I made a small Visual Basic program that can read a text file into a string array. Each element in the array that contains “.pdf” is then appended to a new, initially empty array.

The new array is saved to disk as a new text file.

As a result, the output file contains only strings with “pdf” in it.

Job done! Thanks for hinting me in the right direction.