UTF-8 file loses page break

mxav1111 · December 30, 2023, 8:34am

Hi there,
I was wondering if anyone faced this issue or should it be filed as bug?

UTF-8 file has page break character (0x0C ) or a formfeed.

If we simply open the file as UTF-8 file and save it again as UTF-8 file, it removes this page break and pretty much replaced with CRLF (I think).

I am trying to use libreoffice version 7.6.4.1 on windows 10 (English).

Can someone please look into this? Thanks for your help.

mikekaganski · December 30, 2023, 8:42am

Note that Writer is not a plain text editor. It imports any file into a Writer document model (think of it as if it converts into “ODF”, which is an approximation); and then when you save, it converts back from its data into the target document format. On the double conversion, it is easily possible that you get not the same data as you had initially. Working with any non-native file format, you should be prepared for such changes. E.g., here you likely get U+000C imported into Writer as a paragraph with a page break property.

Having said that, it might be a reasonable enhancement request to implement exporting paragraph with explicit page break as a dedicated character, when the encoding supports it.

ajlittoz · December 30, 2023, 8:45am

What is the format of your original file? .txt?

An office document is conventionally encoded in such a way as to describe typography and layout, which goes far beyond than just text. Consequently, U+000C FORMFEED is considered a control character, not a page boundary.

What do you try to do? If you want to format a document, don’t try to outwit Writer. Layout instructions are not trivial. Follow the rules and use the program as it is intended. To stuff specific characters into a file, without consideration for character sematics or usage, use a text editor.

mxav1111 · December 30, 2023, 12:04pm

Thank you and Mike for response.
Here’s how i am writing the string. Just for trial i added an additional page break. However, BOTH page breaks are lost.

I am really confused. Why would page break not considered when CR LF are maintained?
Is there a way to confirm this?

Here’s how i am creating/opening and appending into the document using python 3.8 that is expected to have multilingual utf-8 supported strings as part of document as it continues to append. You will notice two page breaks… second page break in different format is just for testing.

with open(output_fl, "a+", encoding="utf-8") as f:
f.write(annotation['text']+"\f"+u"\u000C")

Wait… one more thing. When I added page break using simple menu command (alt-I for insert and then page break (control-enter) , and then saved it (txt using other encoding and choosing UTF-8). And then I closed Libreoffice completely and reopened the same file… there was NO page break that was inserted explicitly using Libreoffice writer. Instead there are more CR LFs (appearing like empty blank line).

Thanks again for all your help.

ajlittoz · December 30, 2023, 12:45pm

I think you’re a bit confused abut what kind of file Writer creates. Its native format is not plain text. A document is represented as a sophisticated XML structured description. Formatting, including breaks, are encoded by XML tags with all necessary information to lay out the text (which is only a small amount of the total description for the final result) in a page, be it a sheet of paper or a screen.

You then understand easily that a page break is not simply a U+000C FORM FEED, but a more complex object giving also directives about what happens with the break (such as change of page geometry = page style, page numbering restart and other actions.

Even CR LF are not recorded in the internal representation because they have no meaning in the ODF semantics. Only a paragraph break is significant. Paragraph contents may span several lines and line boundaries are dynamic because they depend on paper size, indents, font size and weight, … Therefore a line break may occur anywhere and these break positions (within paragraph text) vary according to typographical attributes.

Internally, Writer does not use FF, CR or LF.

Apparently, you want to use Writer as a text editor (to create **plain text = UTF-8 files). Writer is not intended for this. As @mikekaganski mentioned, a plain text file undergoes one conversion on open to translate it into internal format with internal semantics, if possible (then there are approximations) and again on save to translate to external format. In your assumed case (plain text file creation), all formatting information is stripped off, keeping only text and likely only printable characters. Paragraph breaks will be translated to CR, LF or both (may depend on the platform). All remaining formatting is omitted, including page breaks, whether manually added or not.

If your ultimate goal is to study ODF, retrieve the standard describing the normalised XML encoding. But, anyway, don’t try to fiddle with any file encoding. It is really complex. Instead use Writer and learn styles which allow you to drive and control the formatting process in a rigorous, systematic and “professional” way.

Otherwise if you want to stuff any Unicode keypoint into a text file, use a text editor.

Villeroy · December 30, 2023, 4:35pm

menu:File>Open…
Select the file regardless of name suffix.
File type: “Text - choose encoding (*.txt)”
Before the file is opened, a dialog pops up where you can choose encoding and line feed.

mxav1111 · December 30, 2023, 4:51pm

Yes that’s exactly what i am doing.
It is reproducible at will. Opening other encoding utf-8 and even explicitly save as other encoding utf-8 removes/ converts form feed/page break.
It quietly changes file utf-8 text files without any warning , error or notification. It is broken.

mikekaganski · December 30, 2023, 4:56pm

It is correct and expected. I already explained that. But …

It warns you about formatting / data that can be not exported to external format, unless you disable that dialog, which is “I know what I’m doing” act.

No. But as mentioned, enhancement requests are welcome.

mxav1111 · December 30, 2023, 7:44pm

Thank you Mike. I may have missed it probably because using save as…

Other thing that i didn’t check but it increased the file size by 50 100 bytes… don’t know what is causing it but it is doing that too consistently even though input format and output format requested are same with literally no change in file. If it is right thing to do then I guess we as users should be given more information as exactly what are we putting into the file that didn’t exist before. May be through some kind of diagnostic menu option and such.

Page breaks are the only way i realize that can help compare scanned pdf converted text side by side easily for changing messed up text during scan and ocr. Without page breaks, it is just so hard.

Another side question if you don’t mind…

Does writer have ability to visually compare two files say one pdf and one text with the help of sync scroll ( this is where page break can provide great assistance because that potentially could be common link between text file and pdf file)

I will soon proceed with enhancement request.
Thanks for getting involved Mike and helping me out.

mxav1111 · December 30, 2023, 8:41pm

Enhancement request logged:-
https://bugs.documentfoundation.org/show_bug.cgi?id=158938
Here are similar issues logged for rtf and excel file formats.
116753 – When saving a Writer file as RTF, page break is lost and also bug:155722