Find by single indent value

I have a 400-page document which was converted from a PDF file and is therefore completely messed up in all kinds of ways. There is some application of paragraph styles, but it’s random and erratic, and everything has the most ridiculous direct-formatting overrides. I’m trying to apply the base paragraph styles without having to go through 400 pages and applying to each paragraph individually.

The most consistent feature of the messed-up formatting is the paragraph indenting (left indent and/or first-line indent, but not right indent, which is all over the shop and more or less varies for every paragraph in the document), so I’m trying to do some fairly simple finds to apply styles in bulk… but it doesn’t seem to be possible.

In the Find and Replace dialog, you can enter various indenting options in Format → Indents & Spacing, but the issue is that if I enter a value in “First line” and leave the others blank, the other two options (“Before text” and “After text”) are force-filled to a value of zero when I hit “OK”, which completely defeats the purpose when the whole point is that the other values vary. (Word does the same thing, which was the reason I tried in LO in the first place.)

I’ve installed the AltSearch extension and tried various permutations of the syntax shown in this answer, but like the asker of that question, I cannot get AltSearch to work properly. Searching for [:::ParaFirstLineIndent=0400::] finds all instances of the paragraph style I want to apply (which has a 0.4 cm first-line indent), but doesn’t find any of the hundreds of instances of text in other paragraph styles with a 0.4 cm first-line indent. I don’t know if AltSearch is getting auto-applied zero values for the other indent values as well.

How can I do this? Can I do this?

Reorganising a converted PDF is a real nightmare. First thing to do is to restore paragraph logic: in a PDF, every line is its own paragraph; there is no link or relation to surrounding lines. Unfortunately there is no automated way to do this because only semantics (significance) can tell whether two consecutive line are part of the same paragraph.

One possible method to avoid manual job would be to submit your PDF to some OCR utility which would certainly do a substantial part of the task.

I don’t think relying on “geometric indents” is a successful idea. IMHO, it is better to consider your text as unformatted: remove all attributes to be in such a case. If the paragraph structure is correctly rebuilt in the previous step, apply paragraph styles. I know, this means reading the 400+ pages but you probably can select large areas in pages and apply Body Text. Heading are easily identified and they can be styled Heading n (don’t forget to remove the existing numbering if any so that automatic consistent numbering takes over).

This means that you have thought deeply about the paragraph styles you’ll need in your final document. This is a very important step if yo want to minimise your pain. You’ll also need character styles for intra-paragraph formatting.

The difficulty in the task comes from the fact that PDF is a “visual” format meant for display while Writer plays at the semantic level where styles “tag” the significance and importance of text. Formatting comes last as a consequence of the mark up. Formatting can be tuned independently from text once it has been “tagged”.

The PDF was originally exported from InDesign and converted using Adobe’s cloud-based PDF conversion engine, so paragraphs are at least mostly correctly connected, rather than being chopped up line-by-line (sadly the new PDF-to-InDesign conversion doesn’t work on the file, probably because it was exported from InDesign CS3, which I’m guessing is just too old to be supported).

 

Headings are actually correctly identified, so those are no problem. The trouble is that nearly all the remaining text paragraphs already have Body Text applied, which of course they shouldn’t have. Paragraphs immediately following headers, figures, captions, tables, blockquotes or spaces should be Normal, but other plain text paragraphs should have Normal Indent applied, and of course blockquotes should have Blockquote applied. These three types of paragraphs are in fact (mostly) consistently distinguished in the file, but only by indentation (first-line indent with no left margin = Normal Indent; left margin = Blockquote).

 

Removing all attributes and starting from scratch is not an option in this case – there are only about six or seven different paragraph styles required for the whole document, but an enormous amount of varying character-level overrides that I would have to manually reapply by visual comparison with the PDF. These are already a headache because they’re inconsistent (including in the PDF – the original layouter clearly didn’t have their character styles properly worked out), and I’ll have to go through them and apply styles manually; if I stripped formatting, I’d have to compare page by page with the PDF in order to even identify them, which would be an absolute nightmare.

 

Additionally, being an academic book, the ratio is about 60% Normal Indent, 20% Normal and 20% Blockquote (not counting a few rarer styles that only show up in a few places), all interlaced; so bulk selection would rarely go beyond half a page, which is what I’m trying to avoid.

Oh – a lucky break!

 

The first thing I did was to simply import the file into InDesign to do the clean-up there (because the search options there are far superior to Word or LibreOffice), but the text was completely messed up, with half the word spaces just disappearing and entire paragraphs being squished into single words.

 

After doing some initial clean-up in Word and LibreOffice, I just tried importing the file into InDesign again now, and this time the spaces were all correctly imported, so something must have helped.

 

That thankfully means I can do the rest of the preparation in InDesign, where searching for a single indent option is easy as pie.