# Misformatted Calc sheet from tab delimited text import?

Hello all,

I've written a C program to extract XML data from the Discogs.com database and write it to a tab-delimited text file intended to create a LibreOffice Calc spreadsheet. However, when I attempt to open the text file in Calc, it doesn't seem to get imported correctly when there are an odd number of quotation mark characters in the text.

The format of the text file is: "field1"TAB"field2"TAB... where TAB = 0x09 and fields can contain the doublequote character and other special characters excluding TAB.

I'm running this Calc version on Windows 7 Home Premium, Service Pack 1: Version: 6.1.0.3 (x64) Build ID: efb621ed25068d70781dc026f7e9c5187a4decd1 CPU threads: 4; OS: Windows 6.1; UI render: default; Locale: en-US (en_US); Calc: group threaded

Attached is the text file I'm opening with a fake file extension (rename it from "csv2.jpg" to "csv2" -- file restrictions here don't let me upload a text file or a no-extension file), and PNG screenshots of the Text Import dialog and its misformatted result. I've looked through the text file and don't believe I have any errors in it. Yet there are some occasions where the TAB character ends up in a cell of the spreadsheet, rather that being parsed as a delimiter to go to the next cell. The reason I like Calc for this project is it handles multiline data properly, putting it one cell when it's doublequoted. Competing products do not; they produce multiple misformatted lines when \n is found in the imported data.

One can see the error in Row 5 (I selected cell E5 to show its content, which contains 3 doublequote marks in addition to the outer delimiters). Row 7 contains 10 doublequote marks in addition to the outer delimiters, and it doesn't have the misformatting error.

Does anyone have an idea how I can import this data correctly, or have an opinion on whether this is a bug? I would think that Calc should be agnostic about what is between the \0x09\" and \"\0x09 delimiters on any field that would map cell data, but clearly it isn't, and unfortunately I don't have control over how many " marks are in the fields. Perhaps I can work around this by using a different string delimiter that's less likely to be in the data, but I wanted to report it in case it's really a bug.

edit retag close merge delete

there is a practical limitation to distinguish between content and delimiters once you allow any character and any combination of them in the content,
it's something near the - proven? - theorem that no language can be described / structured using that! language, you need at least one extra sign (or similar),
(not using string/field delimiters will produce problems with CRLF and TABs inside your data fields, using string delimiters will produce problem with those already present in your data, i'm not sure if 'quoting' will hold for all combinations)
as you have written the extraction program you have your hands on a point where you have access to the content of single fields?, and can steer what's used as separator/delimiter?
either rework the fields at that point to not! contain your separator/delimiter,
or use cryptic combinations that do not ...(more)

( 2020-08-29 09:45:59 +0200 )edit

Sort by » oldest newest most voted

Import from TSV follows the same rules as for CSV, just with a Tab field delimiter instead of a Comma field delimiter, so RFC 4180 applies. Namely the rules about quoted field content with double quotes (called "string delimiter" in Calc's import dialog), 5, 6 and 7; that is if the field delimiter (here Tab) or a newline is contained in a field's data then the field must be quoted and if a quoted field contains a quote that quote must be escaped by doubling it.

Additionally Calc tries hard to accept and correct known errors that multiple bad generators produce, but it can't correct each and every possible error. Generate correct data and it will work.

more

Hi erAck, thank you for this explanation. I was not aware that if the field delimiter appears in a field's data that it must be escaped by doubling it. I just read RFC 4180; again, thank you for the link, and I agree with you that my input data is nonconforming. In fact, I did work around this issue by adding a second " when the count was odd, but I'll now rewrite my code to do the proper escaping instead. Consequently, I accept your explanation as the correct answer. Cheers

( 2020-08-30 01:26:25 +0200 )edit

You have uploaded the .csv file as some .jpg image file on the link "text file I'm opening with a fake file extension ". I just downloaded it, and renamed the .jpg extension to .csv. Your .csv file _contains_ (") signs INSIDE the data and before/after data.

There is not any way to handle it appropriately. You must use other string delimiter or other quote mark inside a data in your export software. Choose a non-used ASCII or UNICODE character. Note/Tip: the curly qoute signs and the straight qoute signs are different characters with different code - in the UNICODE table. Use the straight for the delimiters, and use the curly ones inside the data.

more

Hi Zizi64, thanks for your comment. I clarify that my text file is not .CSV -- it is .TXT and trying to put it in Excel is even worse than LO. Yes, it does contain " inside the " delimited strings -- and it almost imports correctly. In fact, if the number of " marks inside the quoted strings is even, it does import correctly.

There is a way to handle it appropriately, which is to use the TABs to delimit the fields (cells) and use the very first and lastString delimiter to delimit the string regardless of whether it contains another String delimiter. If parsed from opposite ends (and the string adjacent to the tabs has String delimiter at each end), then anything between is the string whose destination is that cell. That may be the LO bug.

Finally, the reason I can't select a different delimiter is that in this data ...(plus)

( 2020-08-28 10:10:58 +0200 )edit

"There is a way to handle it appropriately, which is to use the TABs to delimit the fields (cells) and use the very first and last String delimiter to delimit the string regardless of whether it contains another String delimiter. "

Maybe there is some ways. But it better to use different characters for different functions. The .TSV (Tab Separeted Values, and the .CSV (Comma Separated Values or Character Separated Values) can not distinvish between same characters in different position. For example what about when you want to put a TAB character _into_ a string, or a comma character inside a string in a .csv file?!

(It is better to use the extensions .csv or .tsv instead of the .txt - which means unorganized pure text file type for many applications.)

( 2020-08-28 11:21:34 +0200 )edit

Agree it's better to use different characters, but the input file to my program is 40GB and there are no unused chars in its text fields that I could choose as a delimiter. All characters including TAB and comma are in the data. Best I can do is make sure all delimiters match and everything is TSV. I couldn't use .csv/.txt/etc. because this platform does not allow those to be uploaded. That's why I had to upload as jpg and readers must rename it (I suggested no extension at all). The program I wrote could output any unused character as the String delimiter, but there aren't any unique ones available. Thats why I was wanting a TSV with " as a String delimiter to import, but it does not work properly. It works properly if there is an even number of " marks in any tab-delimited ...(plus)

( 2020-08-28 11:53:43 +0200 )edit

In the import dialog you have left " as String delimiter. As in your file, the content which ends up located in E5 has a TAB between double quotes, the import dialog leaves it as text.

When importing, delete the content of the field String delimiter. Or check the code of your program.

LibreOffice Help on Opening and Saving Text CSV Files.

more

Hi LeroyG, thanks for your suggestion; I will try deleting the content of the String delimiter field.

However, my source file does not contain a TAB between double quotes that was intended to be data. (Technically, it does contain that, but I still think the import parsing is not quite working correctly).

Specifically, for the text data in my source that maps to cell E5, there is a TAB preceding the text and a TAB following the text. The text contains a doublequote (") at the beginning and end of the string (both adjacent to the TAB delimiting that cell). The text also contains a doublequote in the source string corresponding to E5.

It seems to me the problem is that the string that belongs in E5 isn't properly parsed because the doublequote within the string that should be in E5 is interfering with it. Another possibility is the string ...(plus)

( 2020-08-28 06:20:13 +0200 )edit