Ask Your Question
1

Writer: delete all paragraphs that do not contain a phrase

asked 2018-04-10 01:58:58 +0100

Happy User gravatar image

Hello all,

I have a long text with many paragraphs. Each paragraph is the path to a file or a directory. Some of the paths are to PDF files, like: Path\booktitle.pdf

I want to delete all paragraphs that do not contain the phrase .pdf so that I am left with a text containing just paths to books. Is there a clever way to do this in Writer? Or maybe Calc? There is a maximum of one .pdf per paragraph.

My version is LibreOffice 6.0.2.1

Many thanks, Ruud

edit retag flag offensive close merge delete

Comments

What operating system do you use? There could be some very simple solutions using plain text files, but the tools available depend on the operating system. Thanks. It would also help enormously if you could give some "sample paragraphs" -- is it only the path statement? does the .pdf element always come last? So a couple "real" samples would help a lot.

David gravatar imageDavid ( 2018-04-10 20:31:43 +0100 )edit

I use windows 10

I have 7 list.txt files made for 7 disks with the dir command (on a windows 7 system)

Each list.txt contains many thousands of paths to all kinds of files and directories.

My friend wants to reduce the files so that they contain only the paths to her PDFs. I now realize that the search and replace option in writer can't do that easily.

Happy User gravatar imageHappy User ( 2018-04-10 21:52:47 +0100 )edit

@Happy User - You should be able to use a grep tool for Windows with your text files and it would be a doddle.

David gravatar imageDavid ( 2018-04-10 22:01:31 +0100 )edit

So many good answers! I will study the grep tool!

Happy User gravatar imageHappy User ( 2018-04-10 22:47:47 +0100 )edit

@Happy User - my sense is still that grep is the "right tool" for this job. Something like $ grep pdf data.txt > out.txt would save all PDF "lines/paragraphs" to their own textfile in one go. See SuperUser for help on using this method on Windows. Simples. :)

David gravatar imageDavid ( 2018-04-12 12:24:51 +0100 )edit

3 Answers

Sort by » oldest newest most voted
0

answered 2018-04-10 21:07:52 +0100

David gravatar image

updated 2018-04-11 17:33:20 +0100

IF your text has a pattern like this (please clarify!):

Path\booktitle.pdf

Path\booktitle

Path\booktitle.pdf

etc.

Then with a simple regular expression search you can select all your "non-PDF" lines with "Find all" and delete them, leaving you with only those lines with the path to the PDF. The regex is:

^(?=.*?Path)(.(?!pdf))*$

N.b.!!! "Path" in that regex statement is simply reflecting OP's "dummy text" used in my test file. It will need to be adjusted for "real world" use. See OP's comment to this answer, below. ]

And this is how it looks:

screenshot

If your text is embedded in a longer paragraph, there is still likely to be a regex expression that will capture it, but we need more details before offering help with that. Please edit the question with more information.

edit flag offensive delete link more

Comments

@Floris v had already suggested a solution based on RegEx. Seems it was withdrawn meanwhile. Anyway the OQ wanted to delete the praragraphs not containing the "phrase". Using 'Find All' this way would require a subsequent copy/paste. This would, however, result in a single paragraph containing all the findings.
Can you also suggest a Regex to FindAll the paragraphs not containing one of zthe critical strings for deletion?

Lupp gravatar imageLupp ( 2018-04-10 21:16:36 +0100 )edit

@Lupp - Hmmm ... I see. One could break up the single paragraph containing all the findings easily enough, but perhaps there is a negative lookahead or something. I'll do a little thinking.....

(30 mins later) ... And I think I cracked it! See what you think.

David gravatar imageDavid ( 2018-04-10 21:31:01 +0100 )edit
1

Oh that is interesting! It's just that the paths in the files I have aren't all called Path. Actually, none of then is. They are all sorts of paths, like D:\TopDirName\SecondDirName\filename.pdf

^(?=.?Path)(.(?!pdf))$ selects nothing, but aftera small change, removing Path,

^(?=.?)(.(?!pdf))$ selects precisely those lines that need to be deleted

Awesome! Thanks!

Happy User gravatar imageHappy User ( 2018-04-10 22:45:31 +0100 )edit
1

Yes. I also had tried it with a negative lookahead, but obviously I missed a point. Your suggestion works fine (except for the constant used for the path). To allow for the "word" pdf if not occurring together with the extension delimiter ., I would now suggest to use ^(.(?!\.pdf))*$. (There is the tiny drawback of the empty paragraphs remaining.

Lupp gravatar imageLupp ( 2018-04-10 23:29:55 +0100 )edit
1

answered 2018-04-10 13:29:00 +0100

Lupp gravatar image

updated 2018-04-10 21:08:59 +0100

(Edited)

Since I didn't find a way to do this efficiently by built-in tools I decided to write a raw piece of code for the purpose. (There was a second similar request at the same time.) The code below can be refined in many ways. First of all it may be desirable to work on the current selection instead of the complete text.

Sub removeNotHavingPars(Optional pHas As String)
If Not ThisComponent.SupportsService("com.sun.star.text.TextDocument") Then Exit Sub
If IsMissing(pHas) Then 
    pHas          = ".pdf" ' May need refinement!
End If
theText       = ThisComponent.Text
theCursor     = theText.CreateTextCursorByRange(theText.Start)
Do While theText.CompareRegionEnds(theText, theCursor)<0
theCursor.GotoEndOfParagraph(True)
If  InStr(theCursor.String, pHas) = 0 Then
    theCursor.String = ""
End If
If theCursor.String ="" Then 
    theCursor.GoRight(1, True)
    theCursor.String = ""
Else
    theCursor.GoRight(1, False)
End If
Loop
theCursor.GotoStartOfParagraph(True)
If theCursor.String="" Then
    theCursor.GoLeft(1, True)
    theCursor.String=""
End If
End Sub

You find it demonstrated in this attached text document.

edit flag offensive delete link more
0

answered 2018-04-10 22:19:22 +0100

Happy User gravatar image

updated 2018-04-10 23:01:20 +0100

Thank you Lupp! That was helpful.

Using your hint, I made a small Visual Basic program that can read a text file into a string array. Each element in the array that contains ".pdf" is then appended to a new, initially empty array.

The new array is saved to disk as a new text file.

As a result, the output file contains only strings with "pdf" in it.

Job done! Thanks for hinting me in the right direction.

edit flag offensive delete link more
Login/Signup to Answer

Question Tools

1 follower

Stats

Asked: 2018-04-10 01:58:58 +0100

Seen: 88 times

Last updated: Apr 11 '18