Hello all,
I’ve written a C program to extract XML data from the Discogs.com database and write it to a tab-delimited text file intended to create a LibreOffice Calc spreadsheet.
However, when I attempt to open the text file in Calc, it doesn’t seem to get imported correctly when there are an odd number of quotation mark characters in the text.
The format of the text file is:
"field1"TAB"field2"TAB…
where TAB = 0x09 and fields can contain the doublequote character and other special characters excluding TAB.
I’m running this Calc version on Windows 7 Home Premium, Service Pack 1:
Version: 6.1.0.3 (x64)
Build ID: efb621ed25068d70781dc026f7e9c5187a4decd1
CPU threads: 4; OS: Windows 6.1; UI render: default;
Locale: en-US (en_US); Calc: group threaded
Attached is the text file I’m opening with a fake file extension (rename it from “csv2.jpg” to “csv2” – file restrictions here don’t let me upload a text file or a no-extension file), and PNG screenshots of the Text Import dialog and its misformatted result. I’ve looked through the text file and don’t believe I have any errors in it. Yet there are some occasions where the TAB character ends up in a cell of the spreadsheet, rather that being parsed as a delimiter to go to the next cell.
The reason I like Calc for this project is it handles multiline data properly, putting it one cell when it’s doublequoted. Competing products do not; they produce multiple misformatted lines when \n is found in the imported data.
One can see the error in Row 5 (I selected cell E5 to show its content, which contains 3 doublequote marks in addition to the outer delimiters).
Row 7 contains 10 doublequote marks in addition to the outer delimiters, and it doesn’t have the misformatting error.
Does anyone have an idea how I can import this data correctly, or have an opinion on whether this is a bug?
I would think that Calc should be agnostic about what is between the \0x09" and "\0x09 delimiters on any field that would map cell data, but clearly it isn’t, and unfortunately I don’t have control over how many " marks are in the fields.
Perhaps I can work around this by using a different string delimiter that’s less likely to be in the data, but I wanted to report it in case it’s really a bug.