Preserve paragraph/line spacing when converting rich text to plain text

MBSquared · May 13, 2012, 10:41pm

Is there a way to convert a document with rich text (.doc, .odx, .rtf) to plain text (.txt) AND keep the original paragraph alignment?

It doesn’t have to be perfect. Keeping only the alignment of the paragraphs first line by means of spaces or tabs would be ok. But, if using spaces/tabs, would there be a method to distinguish between the tabs/spaces which are part of the original document and those which are merely used to preserve the original alignment?

For example, assume a rich text document has indentations create NOT using spaces or tabs, but only the indentation settings on the ruler, like this:

image description727×312 44.9 KB

As alluded to above, one possible solution would be to write a script that identifies the indentation of each line (ie .5", 1", 1.5"…) and places the correct number of spaces to fill that length.

Assuming this is the only method to it is probably not possible to completely preserve the formatting (ie only the first line would have the correct spacing, but upon word wrap, spacing would be thrown off).

Any ideas?

Thanks,
Matthew B.

SveinníFelli · May 21, 2012, 2:16pm

I’m unaware of an existing/direct method of saving such a text document from any word processor. As you seem to know, no fancy formatting is retained in .txt files. You’d have to use some kind of Search/Replace. I don’t know the functions of programs like Antiword, which may support such replacement features during import.

The first method that comes to mind is to save the document as HTML. Which gives code like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML>
<HEAD>

<STYLE TYPE="text/css">
<!--
	@page { margin: 2cm }
	P { margin-bottom: 0.21cm }
	H1 { margin-bottom: 0.21cm }
	H1.western { font-family: "Liberation Serif", serif }
	H1.cjk { font-family: "SimSun" }
	H1.ctl { font-family: "FreeSans" }
	P.indent-first line { text-indent: 0.4cm }
	P.indent-2 { margin-left: 0.6cm }
-->
</STYLE>
</HEAD>
<BODY LANG="en-EN" DIR="LTR">

<H1 CLASS="western">Immigration Law Service, 2d </H1>
<P STYLE="margin-left: 0.40cm">Chapter 1.</P>
<P STYLE="margin-left: 0.60cm">Summary</P>
# above is using indentation via rulers, beneath is using styles #
<P CLASS="indent-first line">I. Background History and Records</P>
<P CLASS="indent-2">Research References</P>
<P CLASS="indent-2">A. America as a Nation of Immigrants</P>


</BODY>
</HTML>

After this you can use Search/Replace in a text editor to replace the entities for spaces, such as:

"<P CLASS="indent-first line">"   for "  " (2 spaces)

If you have defined a simple set of 2-6 indentation types, this should not be too tedious using a “Replace All” command. Then it needs some cleaning (substitute end tags like </P> for newline character) etc. Sometimes care has to be taken not to mess up with encoding and special characters; I think LibreOffice does not convert them to entities IF the HTML file is UTF8 encoded.
The replacement process can surely be scripted for longer documents with a shell script, and probably exists somewhere on the web (problem is to find one which does exactly what you want).

MBSquared · April 10, 2013, 4:40am

Thank you! I had gotten busy and forgotten all about this post. Great response. Wish I could vote it up, but it says I need 15+ point to vote up.

mariosv · May 14, 2012, 12:26am

More easy can be, save a .txt file

Menu/File/Save as/ Type = Text (.txt) (*.txt)
or Text encoded (.txt) (*.txt)

AlexKemp · October 16, 2015, 1:45am