Text ranges deleting

Dhaen · June 26, 2021, 3:22am

1)INPUT:

“wget -O”-downloaded URL list(nearly 500 webpages from single website ) with certain text info, which is converted to ODT.

2)OUTPUT:

I wanna have a solution for deleting all %SITE_FOOTER%+%SITE_HEADER% text blocks (SITE_HEADER and SITE_FOOTER may consist of several paragraphs each).

3)ISSUE(COMMON CASE):

i have two repetitive “strings” (at this point - LO Writer Paragraphs, for example String1=="####" and String2=="@@@@") in my document. I want to delete all “text ranges”, started by String1 and ended by String2.

4)(?TEMPORARY?) SOLUTION:

(?s)(?<=####).?(?=@@@@) . Search//Replace Regex phrase, which is worked in (for example) Sublime Text and NOT working in LO Writer. BUT, in this case converting to ODT happens AFTER implementing this solution.

---------------

5)OPTIONAL QUESTION:
Is it possible in LO Writer to regex-ly search for text fragments at the JUNCTION of two consecutive paragraphs? ( for example - ^.####$%CRLF%^@@@@.*$ , where %CRLF% is paragraph break??delimiter. )

ajlittoz · June 26, 2021, 6:18am

If your goal is to remove header and footer from pages (and I assume you want to do that on every page), your idea of using Find & Replace is faulty. The flaw in your idea is to suppose that header and footer text is “standard text” interspersed inside the text of discourse flow.

Header and footer are composed only once and stored separately from main text flow. They are dynamically inserted into the output (screen or printer) when needed.

To remove them, you just need to dispose of the special storage. This storage is associated with the page style(s). You didn’t describe the conversion process, in particular if it provides a single page style across the converted document or one page style per resulting converted page.

To list the page styles, display the style side pane (F11 or Styles>Manage Styles if not yet visible). Click on the fourth icon from left in the sidepane toolbar to display the pages styles. Select Applied Styles from the drop-down menu at bottom. For every page style do:

right-click on style name and Modify
in Header tab, uncheck Header on
in Footer tab, uncheck Footer on
OK

Regarding your item 5), except for a special case, a Writer regexp has paragraph scope only. It is applied only to contents between two paragraph marks, exclusive. The special case searches for an empty paragraph as ^$ and allows to delete it. The AltSearch extension works differently and has not this limitation.

Note also that a paragraph mark is not text. It is not encoded as CR + LF. I.e. you suppose that text is the primary data with embedded breaks while the internal representation is reversed: a set of paragraph objects, each one containing text. This means you cannot search text the same as in a text editor. Find or Find & Replace are implemented in such a way as to give the illusion it works the same. This also explains why some possibilities of text eitors can’t be offered in Writer.

To show the community your question has been answered, click the ✓ next to the correct answer, and “upvote” by clicking on the ^ arrow of any helpful answers. These are the mechanisms for communicating the quality of the Q&A on this site. Thanks!

Dhaen · June 27, 2021, 10:14am

OK, yes, i understand.

01)My question was not about LO Writer Document Headers//Footers (russian interface caption: “верхние//нижние колонтитулы”), but about WEBSITE headers//footers, which are ODT-converted to repetitive text blocks (website menu elements and so on) within the main document body.

02)I know about Altsearch extension, but this experience was about 2-3 years ago. Does it works normally with LO 7.1.3.2 ? (besides, i additionally have standalone copy of LO 5.3 especially for it).

03)Is there any alternatives for Altsearch extension, but more about implementation of something kinda like bash text manipulation tools (sed//awk//…) functionality into LO?

ajlittoz · June 27, 2021, 10:19am

No experience with WriterWeb because I prefer to generate my pages with dedicated applications
AltSearch is still actively maintained, but I don’t use it.
Built-in Edit>Find & Replace has a basic regexp capability. See built-in help for details.