Character set when opening a CSV file without a BOM in Calc.

asked 2021-01-13 17:03:21 +0100

jmullen gravatar image

updated 2021-01-13 17:24:19 +0100

Why isn't confusion of the file associated with character set avoided by either making it clear that the character set as currently initially highlighted is the one confirmed on opening of the previous file or else if that is too complex then simply making the initial text 'Select a character set' implying the current file?

edit retag flag offensive close merge delete

Comments

I am not certain what you mean by “file associated with character set “ in a .csv file. A .csv file is a simple text file and contains no formatting information, character set, field delimeter etc. need to be “guessed”. That is why a “TEXT IMPORT” or “TEXT EXPORT” panel is shown when you import or export the file.

Perhaps you could clarify what you mean.

petermau gravatar imagepetermau ( 2021-01-14 11:39:54 +0100 )edit

The initial character set highlighted is the one that was previously confirmed when the last file was opened. This is not made explicit either on the interface or in the docs. Consequently I think most users quite naturally assume this char. set is instead the one for the file currently being opened.

I am aware that without a BOM there is no guaranteed way of correctly establishing the correct char. set for the current file. I can therefore see the attraction to the developers for using, as an alternative, the previous file's confirmed char. set. But I don't think this is very intuitive for the average user. It will very likely lead to confirming the wrong char. set when the current file's actual char. set differs from the one confirmed for the previous file.

To avoid mistakes like this would it not be clearer to either explicitly ...(more)

jmullen gravatar imagejmullen ( 2021-01-14 13:43:15 +0100 )edit

A third alternative could be to do as Excel does when there is no BOM and for Windows use its char. set default which for Western Europe type users is likely to be CP 1252.

jmullen gravatar imagejmullen ( 2021-01-14 14:01:20 +0100 )edit

Excel would not have the problem illustrated in the file seq. below. The file's char. sets are conservatively limited to mostly the system's default non-Unicode one but occasionally one from a BOM. It is assumed that after opening each file the user takes no notice nor makes any change to the displayed char. set.

File_1 is opened. It has the common Windows ANSI non-Unicode char. set having no BOM. The ignored char. set shows Western Europe (Windows-1252/WinLatin1) because, for the sake of this example, coincidentally it remains the same as the one confirmed for the previous file. The data is fine.

File_2 is opened having the less common BOM and with char. set UTF16. The ignored char. set shows Unicode (UTF16). The data is fine.

File_1 is reopened. The ignored char. set still shows UTF16 because the current file has no BOM. The previously unaffected data has become unreadable.

jmullen gravatar imagejmullen ( 2021-01-14 14:10:20 +0100 )edit

It doesn't matter if user realizes that the value comes from last attempt, or thinks it's from this file. The field is there to allow choosing in case user sees a problem (characters are shown wrong in the preview); if they are shown correctly, and user does not know for sure what is the actual file's encoding, the source of the present value has no difference. If they know the actual encoding, it also is not a problem, because they will see that the value is different from correct. So what actual problem do you intend to solve, which real-life scenario would be improved by changes you propose?

Mike Kaganski gravatar imageMike Kaganski ( 2021-01-15 10:12:41 +0100 )edit

I provided an actual real life problem in my previous comment that happened to my colleagues. The problematic data was way beyond the small preview area. I had to fix the database into which the data was imported when the problem was eventually exposed weeks later.

I think Excel's usage of the system default char. set when there is no BOM is less likely to give a problem for the average user who doesn't really understand char. sets and just accepts the default shown.

jmullen gravatar imagejmullen ( 2021-01-15 10:51:35 +0100 )edit

What are you talking about? Your question is "Why isn't confusion of the file associated with character set avoided by either making it clear that the character set as currently initially highlighted is the one confirmed on opening of the previous file or else if that is too complex then simply making the initial text 'Select a character set' implying the current file?". It is worded that way; and I ask you again: how will that change the "real-life problem" your colleagues had, if the text was changed as you suggest. They would see some encoding there with the changed wording; they would see normal text in the preview; and they would happily proceed exactly the same way.

Mike Kaganski gravatar imageMike Kaganski ( 2021-01-15 14:22:10 +0100 )edit

Thank you for your consideration & continued response but I think that after a reasonable but completely unsuccessful attempt to persuade you that I had better draw my efforts to an end.

jmullen gravatar imagejmullen ( 2021-01-15 14:32:50 +0100 )edit