Ask Your Question
0

Preserve paragraph/line spacing when converting rich text to plain text [closed]

asked 2012-05-14 00:41:00 +0200

MB Squared gravatar image

updated 2012-05-14 00:44:54 +0200

Is there a way to convert a document with rich text (.doc, .odx, .rtf) to plain text (.txt) AND keep the original paragraph alignment?

It doesn't have to be perfect. Keeping only the alignment of the paragraphs first line by means of spaces or tabs would be ok. But, if using spaces/tabs, would there be a method to distinguish between the tabs/spaces which are part of the original document and those which are merely used to preserve the original alignment?

For example, assume a rich text document has indentations create NOT using spaces or tabs, but only the indentation settings on the ruler, like this:

  • image description

  • image description

  • image description

As alluded to above, one possible solution would be to write a script that identifies the indentation of each line (ie .5", 1", 1.5"...) and places the correct number of spaces to fill that length.

Assuming this is the only method to it is probably not possible to completely preserve the formatting (ie only the first line would have the correct spacing, but upon word wrap, spacing would be thrown off).

Any ideas?

Thanks, Matthew B.

edit retag flag offensive reopen merge delete

Closed for the following reason the question is answered, right answer was accepted by Alex Kemp
close date 2015-10-16 03:45:54.743516

2 Answers

Sort by » oldest newest most voted
1

answered 2012-05-21 16:16:06 +0200

Sveinn í Felli gravatar image

I'm unaware of an existing/direct method of saving such a text document from any word processor. As you seem to know, no fancy formatting is retained in .txt files. You'd have to use some kind of Search/Replace. I don't know the functions of programs like Antiword, which may support such replacement features during import.

The first method that comes to mind is to save the document as HTML. Which gives code like this:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

<HTML>
<HEAD>

<STYLE TYPE="text/css">
<!--
    @page { margin: 2cm }
    P { margin-bottom: 0.21cm }
    H1 { margin-bottom: 0.21cm }
    H1.western { font-family: "Liberation Serif", serif }
    H1.cjk { font-family: "SimSun" }
    H1.ctl { font-family: "FreeSans" }
    P.indent-first line { text-indent: 0.4cm }
    P.indent-2 { margin-left: 0.6cm }
-->
</STYLE>
</HEAD>
<BODY LANG="en-EN" DIR="LTR">

<H1 CLASS="western">Immigration Law Service, 2d </H1>
<P STYLE="margin-left: 0.40cm">Chapter 1.</P>
<P STYLE="margin-left: 0.60cm">Summary</P>
# above is using indentation via rulers, beneath is using styles #
<P CLASS="indent-first line">I. Background History and Records</P>
<P CLASS="indent-2">Research References</P>
<P CLASS="indent-2">A. America as a Nation of Immigrants</P>


</BODY>
</HTML>

After this you can use Search/Replace in a text editor to replace the entities for spaces, such as:

"<P CLASS="indent-first line">"   for "  " (2 spaces)

If you have defined a simple set of 2-6 indentation types, this should not be too tedious using a "Replace All" command. Then it needs some cleaning (substitute end tags like </P> for newline character) etc. Sometimes care has to be taken not to mess up with encoding and special characters; I think LibreOffice does not convert them to entities IF the HTML file is UTF8 encoded. The replacement process can surely be scripted for longer documents with a shell script, and probably exists somewhere on the web (problem is to find one which does exactly what you want).

edit flag offensive delete link more

Comments

Thank you! I had gotten busy and forgotten all about this post. Great response. Wish I could vote it up, but it says I need 15+ point to vote up.

MB Squared gravatar imageMB Squared ( 2013-04-10 06:40:55 +0200 )edit
0

answered 2012-05-14 02:26:47 +0200

m.a.riosv gravatar image

updated 2012-05-14 02:31:21 +0200

More easy can be, save a .txt file

Menu/File/Save as/ Type = Text (.txt) (*.txt) or Text encoded (.txt) (*.txt)

edit flag offensive delete link more

Question Tools

1 follower

Stats

Asked: 2012-05-14 00:41:00 +0200

Seen: 6,896 times

Last updated: May 21 '12