I am hoping that someone can help me with a way of semi-automating a task I run at least fifty times a day, usually many more than that.
I use LibreOffice Writer with the add-on altsearch.oxt which is a very excellent tool for searching and replacing within text.
I receive documents from many different sources and they are each formatted differently … and usually wrong for our purposes. Some are emailed and have a line length of 72 with a paragraph break at the end of each line, some have indents at the beginnings of paragraphs, some indents will be tabs, some will be a number of space bar presses, some will be random and unequal spare bar presses. Some have random double spaces through-out the document. Some use line breaks instead of paragraph breaks … they are all different and usually wrong for what we need, or even what we ask for.
We have asked for and insisted upon standard (for us) formatting, but it is not likely to happen. Realistically, I have to correct each submission.
So, in LibreOffice Writer, I save each submission as an ascii text document. I then search for and replace a series of conditions and clean each document so that it is a standard that we can work from.
As an example:
Step 1. I search for double paragraph breaks and replace them with a nonsense series of characters unlikey to be found in the document.
Replace “\p\p” with “XXXXXX”
Step 2. I then search for single paragraph breaks at the end of each line and replace them with a space.
Replace “\p” with " "
Step 3. I then search for any double spaces and replaces them with a single space. I often have to do this search and replace function several times repeatedly to remove all of the double spacing. Occassionally a submission with have 5 or 10 spaces marking the paragrah indent and I continue to do this until the result is 0 replaced.
Step 4. I the Replace “XXXXXX " with “XXXXXX” to remove any leading spaces from a paragraph. Likewise I will Replace " XXXXXX” to eliminate and trailing spaces at the end of a paragraph.
Step 5. I will then Replace “XXXXXXXXXXXX” with “XXXXXX” to eliminate any extraneous paragrah breaks. This, too may need to be repeated several times until the result is 0 replaced.
Step 6. Finally, I will restore the document. Replace “XXXXXX” with “\p\p”
I now have a plain text document which is exactly formatted the way we need it to begin to process it. These steps each need to be performed in the same order and several of them (Step 3 and Step 5 in this example) may have to run several times.
Is there a way I can run this as an automated process automatically repeating Steps 3 and 5 if needed?
Replace “\p\p” with “XXXXXX”
Replace “\p” with " "
Replace " " with " " [may need to be repeated many times]
Replace “xxxxxx " with “XXXXXX”
Replace " XXXXXX” with “XXXXXX”
Replace “XXXXXXXXXXXX” with “XXXXXX” [may need to be repeated several times]
Replace “XXXXXX” with “\p\p”