Many common text editors don’t seem to handle UTF-8 (without BOM), but MS Word at least recognizes the encoding and offers conversion. LO should have the ambition to be better and handle this wthout conversion, I think. Don’t you?
I found the following question in a superuser.com forum:
“Do you have any idea why LibreOffice is going against the Unicode Standard by adding the BOM for utf-8? Quote: “Use of a BOM is neither required nor recommended for UTF-8”. This makes it extremely hard to use such files with other programs (e.g. pdfLaTeX in my case).”
Thanks, /ML
The person at superuser.com is wrong with interpretation of standard. The standard doesn’t say that using BOM is prohibited. “Not recommended” doesn’t mean “wrong”, it means that there’s no recommendation in the standard. That BOM is explicitly allowed is evident from Table 2-4 of standard (the same in v.5 and v.9), and the reason to use BOM is given in next part of the same sentence that was so selectively cited: “… but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature”. Further information is given in “Special Areas and Format Characters” (16.8 in v.5 of standard, 23.8 in v.9).
It’s absolutely meaningless to use BOM in context where it’s known that the data is encoded in utf-8, because byte order in this encoding is unambiguous. If you use, say, some protocol that only uses utf-8, like XMPP, this is the case. That’s why the standard that covers the implementation details of utf-8 doesn’t give explicit recommendations when to use BOM: the reasons to use are outside of scope of the standard.
But LibreOffice Writer’s main use is to create and open documents, and most general situation that it aims to be used in is when a document is created in one place (using one application), then somehow comes to another user and open with some another application in heterogeneous environment and without any accompanying information about used encoding except embedded in the document itself. LibreOffice may be the creating, or the opening, or both; and there are no prerequisites as to both users’ experience WRT encodings and other computing stuff.
In this situation, it’s absolutely logical that LO uses the feature that standard explicitly mentioned as suited as signature of format to allow unambiguous detection of the very fact that the document uses utf-8. This is the direct answer to the part of your question that implies that LO “doesn’t follow recommended standards”, that is clearly false. LibreOffice must by default use most fail-safe method for data exchange, and treat power-user case as non-default special variant.
Of course, the other part of the question is about the possibility to optionally not use BOM. This is addressed in the bug mentioned by @Regina, and in this part, patches from interested parties are welcome. The part that touches what ambition LibreOffice “should have”, is clearly subjective PoV, and aren’t something that should be expected from community-driven open-source project, where no one person defines the overall policy and goals.
“The person at superuser.com is wrong with interpretation of standard. The standard doesn’t say that using BOM is prohibited. “Not recommended” doesn’t mean “wrong”, it means that there’s no recommendation in the standard.” — No, this is misleading and incorrect. “Not recommended” is an idiomatic way to say in English that something should not be done. Until we have a more oppressive form of government installed in the UK or the US, it is considered impolite and thus is not being used, to forbid people doing anything. BOM is a leftover from ancient times that should have been buried LONG ago. The Unicode Consortium and the academia are certainly against using it.
Thanks Regina and Mike for the bug reference and the clarification of the UTF-8, BOM and standards issue!
The difference between my entry into this and the bug entry (and the superuser.com entry) is that I encountered the problem in opening a document encoded in UT-8 without BOM. Other discussions seem to focus on the ability to export in that format. I am not either scratching my own itch; I can handle the problem by converting the encoding with the help of other specialized text editors. My concern is that more non-techy people (than I) preferably should be able to access also this kind of documents. I hate to have to recommend them to turn to MS Word to do it. Couldn’t avoiding this be expected from a community-driven OS projekt?
So: Would it be a big issue to allow Writer to cope with documents encoded in UTF-8 (without BOM)?
What can I best do to forward this?
There is no problem with opening an UTF-endoded text without BOM. After you have selected the file in the file picker, set the file type to “Text - Choose Encoding” and select UTF-8. In Word (here Word 2010) you get a similar ‘Converting’ dialog in case ‘with BOM’ and in case ‘without BOM’ as well. In LibreOffice you have to force the dialog only in case ‘without BOM’.
Thanks Regina! This is what I was looking for, but didn’t find, and your reply solves my problem.
The only remaining thing for the wish list would then be for LO Writer to identify and prompt the user to open this kind of fie with the relevant filter for UTF-8 without BOM.
This, unfortunately, doesn’t answer my question. I open a text file through the Windows explorer context menu. I spent several minutes in vain trying to find how to set the encoding of the file I opened to UTF-8. Neither I was able to find a LibreOffice setting that would make UTF-8 to be the default encoding of files I open. I have a million of such files on my computer, all in UTF-8, and I need a convenient way to open them, view them, and save them, as UTF-8, without unnecessary waste of time or hassle.