Linux csv file has newlines in the middle of a real line because of multi-line cell

Jpezz1 · March 1, 2023, 3:21pm

If a Libre Office Calc exports a csv file and there are cells that are multi-line, then there are newlines in the csv file at the end of every original row but also in every multi-line cell. I am trying to automate an application so manually editing the csv file is out of the question as well as being impossibly difficult on large Calc files.

This is a sample line output from a file I temporarily stored in a file called xxx for test purposes:

3580,"Lantern - Black Hand Lantern","Little Supreme #350","OR","Westchest
Light Co",,"Red Globe",100,"Y","Donated",,,2012,"Display Cabinet 1",,,"yes",

Note the newline makes this look like two lines instead of one. I decided to use sed to get rid of all newlines not immediately preceded by a comma and, in that case, replace the unwanted newline with a space. If there is a newline after a comma, then I want to keep the newline.

My test code looks like this:

cat xxx |sed 's/\([^,]\)|/\1x /'

This should find all newlines preceeded by any character but a comma and replace it with that character followed by a space but it doesn’t work. I read that sed has problems eliminating the last newline on a line but since it can’t tell what the real end of the line is, I tried this:

cat xxx | tr '\n' '|' | sed 's/\([^,]|\)|/\1x /'

This converts all newlines to vertical bars with tr and then uses the same sed but tries to replace any non-comma followed by a vertical bar to the non-comma character followed by a space. Same problem

Now, I know I must be doing something wrong but I can’t figure it out. I started using Unix nearly 40 years ago and used sed a lot but not very much in the last 20 since I retired. This is a volunteer project for a couple of non-profits.

Can someone point out the error in my code or provide an alternate fix that can be easily automated when only the csv file is available?

karolus · March 1, 2023, 3:41pm

I have done a Textimport with your sample-data (with " as TextDelimiter )

and its all ok after import:

Villeroy · March 1, 2023, 3:54pm

Why do you export csv anyway? This makes no sense. Any spreadsheet format would be a better file format for raw table data. Today’s computer users can not handle csv.

mikekaganski · March 1, 2023, 4:08pm

Why not teach that application a proper CSV then? If a cell had several lines, it should be for a purpose. CSV allows newlines in the quoted fields, so making your application CSV-compliant would avoid the need to preprocess CSVs. Or just replace the newlines in Calc before export?

Note also that a newline after a comma is an artifact of your data, where the last record (after the comma) is empty. If it is not empty, the “normal” newline will be nor after the comma.

Villeroy · March 1, 2023, 4:18pm

That application exports proper csv. The newline is within a quoted field. And Calc imports properly if the user knows about the rules.

mikekaganski · March 1, 2023, 4:43pm

Indeed; and that application is called “LibreOffice Calc”.

IIUC, the other application (that @Jpezz1 is automating) imports the CSV.

Villeroy · March 1, 2023, 4:51pm

Got that wrong. “The csv problem” used to be the other way round.

Jpezz1 · March 1, 2023, 5:18pm

I am trying to pull data out of the csv file and put it in separate folders and files. All I need to do is pull out the unwanted newlines with a simple command, not write a whole application to “learn” csv. I have no way to force force users to adapt their way of putting data into the cell for my benefit.

I am NOT trying to import it into Calc or any other existing program or anything that looks like a spreadsheet. I the one case of the specific file, the original is in xlsm format but I can only deal with the text-based csv.

I appreciate any help but asking my “why” is not the issue as I need to get rid of the unwanted newlines. I know there is a simple way in Linux to do that with a short command string. I just don’t remember the exact format I need to use.

sokol92 · March 1, 2023, 5:54pm

You can change the newline character (U+000A) in the cells to any other character before exporting to csv.

Jpezz1 · March 1, 2023, 6:07pm

Not when I don’t have the original file. I expect that a person with not a lot of understanding about exporting files using any options (I am that kind of person) would only have to put the csv file on the target computer and then boot it and the built-in user id would automatically pull out the data needed and display it in a special way all done on a touchscreen monitor. Just the fact that when exporting, they have to remember to check the quote option is enough.
The spreadsheet itself does not require advanced knowledge of spreadsheets. There is little manipulation of data when creating it as it is just rows of information that require just that the data be entered in the proper cell. The only reason for the user putting newlines in the cell is that it helps them view rows better. Asking them to change their procedure ought to be more difficult than a simple command to fix it if it exists.

mikekaganski · March 1, 2023, 6:37pm

You are mistaken if you think it’s simple with a single regex command. I linked the CSV docs for a purpose; you need to understand what to looks for, if you intend to properly remove correct newlines. You need to make sure that you only remove newlines after odd number (1, 3, …) of double quotes throughout the whole file. Note that you must start from the first character, and continue counting. It is not OK to count double quotes on one line, because that one line with 0 double quotes could be in the middle of a long double-quoted multi-line field, and you would need to remove the newline on this line, too, because the overall number of double quotes on previous lines was odd.

Basically, what I wrote still holds true, the only thing that has changed is that now we know that the “other application” that you need to teach “proper CSV” is sed.

A regex could try to count the double quotes using some regex like ^((([^"]*"[^"]*"[^"]*){2})*"[^"]*)\n, and replace first occurrence, then start over … but I am afraid, it could be very inefficient, maybe would hang. I recall some pathological regex patterns that would require infinite time…

Jpezz1 · March 1, 2023, 7:59pm

Why would I care about the quotes? The only newline I care about keeping comes immediately after the comma so if the preceding character is NOT a comma, I want to delete that newline.

The actual line in the csv file looks like this:
3580,"Lantern - Black Hand Lantern","Little Supreme #350","OR","Westchest\nLight Co",,"Red Globe",100,"Y","Donated",,,2012,"Display Cabinet 1",,,"yes",\n

although it doesn’t display the \n when viewing it.

In this case, you can see that the bad newline is after the character t (which varies), but the “good” newline (indicating a real end of line always will have a comma followed by the newline I want to keep. Whatever that character is, I want to replace it by a space which would make the output look like this:

3580,"Lantern - Black Hand Lantern","Little Supreme #350","OR","Westchest Light Co",,"Red Globe",100,"Y","Donated",,,2012,"Display Cabinet 1",,,"yes",

Please explain your need to be concerned with the quotes. Thanks.

mikekaganski · March 1, 2023, 8:34pm

Not mine, and I explained it already. Did you read?

You yourself wrote:

You made it clear that the CSV can be arbitrary. So again: the comma in the end of a record (line) is just one, rather uncommon, case. If you would use it as a condition to not remove the newline (a trivial thing - just use a negative look-behind assertion), you will soon discover what I tried to tell you.

Jpezz1 · March 2, 2023, 3:59pm

As it turned out, my second command line in my first post was almost the solution. It didn’t work for me because I had accidentally put double || between the tr and the sed instead of a single pipe symbol |. I had expected that the final string would need a second tr command because after passing through sed, the end of each line would be a | instead of a \n and this had to be done but since I couldn’t get the sed to work, I was trying to find the problem in my sed command rather than realizing the error was not in sed but between tr and sed. It is already 90% working i.e. I have to take into account some lines with multi-newlines but I know the technique works.
I was confuse as to why quotes were important since I was looking at characters no matter what they were. I was not disputing your assertions but I felt that I didn’t need to be concerned and that is the case for this type of file
I really appreciate your last comment about the trailing comma. In the spreadsheet file, the last column is a picture of the object and, as I didn’t realize, the last comma is not a terminating comma but an indication that the column exists but since it is an image, there is no text. Given that at other museums, their last column is not a picture, I would have to take that into account. Your advice will save me a lot of headaches down the road. I plan, in that case, to have the museum provide me with the number of columns so I can find where the real newline is (i.e. 13 commas for 14 columns so I would indeed have to know odd vs even number of quotes or be able to recognize commas inside the quotes (those would be part of the data) or those outside (those would be separators). For the first cut to demonstrate the feasibility, I will assume the present comma-terminated line.

The part I am working on is just a preliminary to the real application. This part breaks apart the csv file into separate files/folders only when there is a new version of the csv file (the spreadsheet has been updated). Once this is done the application will run every time a visitor wants to pull out the information to learn about the object by using a raspberry pi driven touchscreen.

Thanks to you and everyone else.

mikekaganski · March 2, 2023, 4:14pm

For your case, instead of a regex or requirement to know the number of columns, a simple Python script would be much easier and more robust: which would simply read input CSV character-by-character; count double quotes; output every character to the output copy CSV; and at each newline character would check the current double-quote count, and if it’s odd, instead of outputting this newline, output a space. That’s all logic needed.

A sample csv_remove_newlines.py could look like:

import sys

doublequotes = 0

for line in sys.stdin:
    line_out = ''
    for ch in line:
        if ch == '"':
            doublequotes += 1
        elif ch == '\n':
            if doublequotes % 2 != 0:
                ch = ' '
        line_out += ch
    print(line_out, end='')

and called like

cat path/to/input.csv | python csv_remove_newlines.py > path/to/output.csv