Parsing a csv file using Regex (almost)

verus172 · April 23, 2021, 3:16am

I am successfuly (almost) parsing a csv file using regex. This is a simple version of a more complex structure.

The (csv) text in cell b1 contains like:

c1,c2,c3;d1,d2,d3;e1,e2,e3;f1,f2,f3; (with a comma as field separator and ; as end of record)

To extract each of the data points, I use the following formula

c1 = REGEX(REGEX(b1, “[^;]+”,1), “[^,]+”,1) — first number is CSV row and second is CSV column
c2 = REGEX(REGEX(b1, “[^;]+”,1), “[^,]+”,2)
c3 = REGEX(REGEX(b1, “[^;]+”,1), “[^,]+”,3)
d1 = REGEX(REGEX(b1, “[^;]+”,2), “[^,]+”,1)
d2 = REGEX(REGEX(b1, “[^;]+”,2), “[^,]+”,2)
d3 = REGEX(REGEX(b1, “[^;]+”,2), “[^,]+”,3)
e1 = REGEX(REGEX(b1, “[^;]+”,3), “[^,]+”,1)
e2 = REGEX(REGEX(b1, “[^;]+”,3), “[^,]+”,2)
e3 = REGEX(REGEX(b1, “[^;]+”,3), “[^,]+”,3)
f1 = REGEX(REGEX(b1, “[^;]+”,4), “[^,]+”,1)
f2 = REGEX(REGEX(b1, “[^;]+”,4), “[^,]+”,2)
f3 = REGEX(REGEX(b1, “[^;]+”,4), “[^,]+”,3)

It works with a full data set. The problem occurs when a field in the csv data set is empty. This means that single commas are replaced by double commas when there is a blank csv field.

If there are empty fields in the csv, then the csv will have double commas. Regex in the current form no longer finds the (required) correct number of single commas in the csv which separates the data into the correct number of fields.

I have tried various regex versions, but can’t get it. I was hoping someone could help with a revised regex. Thank you in advance.

I need a Regex command which will find a double comma as two occurrences of a '[comma] and return an empty result.

eeigor · April 23, 2021, 4:56am

And why don’t you want to import the text using ; as the record separator?

verus172 · April 23, 2021, 5:23am

I am a newbie and I welcome being corrected. I don’t know what the protocols are for end-of-field and end-of-record.

In the example above, I am using a , (comma) (end-of-field) and a ; (semi-colon) (end-of-record). I notice csv files generally use commas (end-of-field) and I am guessing a hard return for an end of record (which I can’t create in my setup when creating the csv files).

If changing the end-of-field and end-of-record characters help solve the double comma regex problem I have identified, then I can easily change the end-of-field and end-of-record characters.

mikekaganski · April 23, 2021, 6:20am

Try replacing [^,]+ with (?<=^|,)[^,]*(?=$|,)?

But it’s interesting that you can’t create newlines in your setup when creating the csv files - I’m curious why?

(I’d use ROW()/COLUMN() instead of hardcoding “1”, “2” etc, and for simplicity, used a separate sheet with formulas in A1:Cxxx)

=REGEX(REGEX($Sheet1.$B$1;"[^;]+";;ROW());"(?<=^|,)[^,]*(?=$|,)";;COLUMN())

eeigor · April 23, 2021, 8:36am

It looks like a line separator like “;” is ignored when importing text. Why? Need a line feed code.
But we must move in this direction. What you are trying to do is very ineffective.

verus172 · April 26, 2021, 4:56am

Thanks Mike! I tried replacing [^,]+ with (?<=^|,)[^,]*(?=$|,) as you suggested. It worked! Thank you!! I will have a look at the other suggestions. Thank you everybody!

verus172 · May 5, 2021, 2:23am

I have another use case and I was hoping someone could help me modify the above. Instead of a comma for [end of field] and a ; for end of record, I wanted to change the characters for [end of field] and [end of record].

What would the Regex code be to use EOF (end of field) and EOR (end of record).

I am most grateful in advance