How to convert lots of old text with old capitalization

Hi would any one know if LibreOffice is the right tool for the task?

I have to convert lots of old text with old capitalization of words that are no longer capitalised.

I have tried to convert all text to small letters and having every sentence start with a capital letter.

So far so good!

But it still leaves me lots of instances where a sentence starts with a dialog and the first letter stays small after the quotation mark (")

Furthermore I have been wondering if it is possible to exclude a list of ords from being changed in the first convertion to small letters, and thus not having to manually capitalize them afterwards?

IS LibreOffice the right tool for the task?

Thanks in advance

Sóren

For a thorough answer, we need more information about the documents.

  • are they “plain” text or do they contain also some formatting which is important for you (such as bold words, numbered lists, chapter headings identified in some form or other, →)?

  • are they stored in files under some conventional format (such as *.doc(x) M$ Word, .odt ODF, some legacy word processor, …)?

  • are you able to characterize the various contexts requiring different actions? are these contexts “orthogonal” to each other?

  • is there ambiguity in your specification (e.g. one of the contexts is a perfect prefix for another one)?

If formatting (first point) is not important, I feel that a syntax-driven macro-generator could do the job better. By “syntax-driven”, I mean something being able to select transformation rules based on local context (identification with pattern matching, then a set of rules with or without regular expressions).

Hello @Sóren, Maybe a title like “How to convert lots of old text with old capitalization” is better, and could help other users with the same problem.

There probably isn’t a foolproof way to do this in an automated process. You won’t get around a lot of proofreading.

The old text are plain text from a .txt document bit opened and converted in writer.
I will have to proof read in the end, bit it would be super nice of I didn’t have to correct it to much…
I’m on Ubuntu so what syntax driven program would be recommendable?

Thanks

Sóren

.txt documents are the most favorable case. The idea is to work on .txt first and convert in the end.

GNU Linux offers awk and m4. To some extent, sed could also be used (with pain). But I am not satisfied with these and I work on a more versatile solution. If you’re interested, contact me on ajlittoz (at) users (dot) sourceforge (dot) net and I’ll send you a specification.

An example OLD text could be:
… “Give me X, Y and Z,” the Computer demanded from Henry, “or I will do a,b and c!” Bla bla Bla. Bla Bla bla.

And the NEW result should be something like:
… “Give me x, y and z,” the computer demanded from Henry, “or I will do a,b and c!” Bla bla bla. Bla bla bla.

What I did so far was converting all text to small letters and then using the “automatic capitalisation after full stop” in writer, but that leaves me stranded with a lot of errors to correct after proof reading…

Sincerely
Sóren

And you probably want “a, b and c” instead of “a,b and c”.

My idea is to be more specific, trying to decompose a sequence into a grammar-based structure. Then if a word is capitalized where grammar tells it should not, lowercase it. Of course, proper nouns should not. This can be done by looking up a table of exceptions as you mentioned.

A difficulty may arise with phrase-inserts like the computer demanded. Does it ends “Give me … Z” to start a new sentence with “or I will … c!”? Or it this a continuation of “Give me …”?

Apart from that, this seems quite straightforward (at least for the specification) as long as the language does not use “exotic” Unicode characters. If your text is pure ASCII, no problem. If it is in Latin-1, a little more work but still easy. But if it is unlimited, you must define what a capitalization is.

The problem is a change in spelling

Old books in Scandinaviaan languages had all nouns capitalized!

But having a negative list all nouns would be quite a compilation

I thought it would be easier starting from all small letters and make rules for what should be capitalized?

Proper nouns are easyly searched and replaced!

(LeroyG added extra Enter to separate paragraphs)

I think it’s becoming a bit too specific for this site and I feel more and more that Writer is not the right tool; Contact me on ajlittoz (at) users (dot) sourceforge (dot) net (replace delimiter description in the parentheses by the real character; this is a poor anti-spam measure) so that we continue in private.