Manually parse CSV with escaped delimiters

From Svendsen Tech PowerShell Wiki
Jump to: navigation, search

I'll be using PowerShell, node.js/JavaScript and Python to demonstrate, but these regexes are .NET-compatible and will work in C#, VB.NET, and other regex-capable .NET languages. They should also work in several other regex implementations, except POSIX and, interestingly enough, Perl. It's likely you can adapt the regexes or use them directly in JavaScript, Ruby and other languages as well. The node.js/JavaScript example seems to work the same way as PowerShell and Python.

Usually, you will want a CSV parser whenever possible, but I believe people will face this problem from time to time, so I decided to write this up. It's a sort of best-effort, half-assed attempt with flaws related to how it handles empty fields (,, or ""). If there are no empty fields, it should work reliably.

I haven't googled the problem before starting on the article - I often avoid that if I can, so I don't know how the official nerd doctrine way of doing it might go, but the method I "developed" works, and I can't think of any situations in which it will not work, off the top of my head. Empty fields are potentially the main problem here. Presumably the nerd doctrine way uses a properly built parser with tokens.

I'll walk through a few situations you might come across, but will start with a quick screenshot demonstration that fast-forwards to the end. This might serve to draw your attention.

Here I pretend to have a need to be doing PowerShell's Import-Csv cmdlet's job, and parse the CSV format PowerShell uses when you Export-Csv, and also built in a check for backslash-escaped commas in the data which you're likely to run across using C#, VB.NET, Python and other languages, in various situations.

Manually-parse-CSV-with-dot-NET-regex-demonstration2.png




Step by step in PowerShell

I start by building a CSV file to see how PowerShell's format handles double quotes in strings - since I already know they use the format:

"Field1","Field2","Field3","FieldN"

It turns out it doubles the double quotes up. This means you can construct a regex following the pattern "double quote, and then either two double quotes in a row or one non-double quote character - repeated greedily until we hit a single double quote". Then repeat that.

However, I start by looking at comma-delimited stuff, to support other formats than the double-quoted PowerShell format.

The regex and results will look something like this:

PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?') | `
    foreach { $_.Groups[1].Value }
"CSV field 1"

 "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote"

 "field3"

But this produces three blank lines/elements. I filter out the empty elements by checking that the line contains non-whitespace. Using a Where-Object and $_.Trim() is a clever way of doing that, since an empty string is false and will then be filtered out by Where-Object.

PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?') | `
    foreach { $_.Groups[1].Value } | Where { $_.Trim() }
"CSV field 1"
 "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote"
 "field3"

In retrospect, it occurred to me that using the quantifier "+" instead of "*" should help clean up results directly, and the old method doesn't handle empty elements anyway.

PS C:\temp> $String = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field3"'
PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | `
    foreach { $_.Groups[1].Value }
"CSV field 1"
 "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote"
"" 
"field4"
 

Using the new regex, and checking how this method copes with empty fields, I get results that seem satisfactory for this use case; it copes with the empty double-quoted element. You can see the count of fields is 4.

PS C:\temp> $String = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"'
PS C:\temp> [regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | `
    foreach { $_.Groups[1].Value } | `
    foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } }

CSV field 1
field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote

field4

PS C:\temp> ([regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | `
    foreach { $_.Groups[1].Value } | `
    foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } }).Count

4

Finally, I tack on an example of replacing double double quotes with single double quotes.

PS C:\temp> ([regex]::Matches($String, '((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?') | `
    foreach { $_.Groups[1].Value } | `
    foreach { if ($_ -match '"((?:""|[^"])*)"') { $Matches[1] } }) -replace '""', '"'
CSV field 1
field2\, with\, backslash-escaped delimiters and a little "surprise" quote

field4

PS C:\temp> 

Python example

Here is my attempt at demonstrating this technique using Python.

>>> string = '"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "field3"'
>>> import re
>>> m = re.findall(r'((?:\\,|[^,])*)(?:,(\\,|[^,]*))*?', string); m

[('"CSV field 1"', ''), ('', ''), (' "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote"',
''), ('', ''), (' "field3"', ''), ('', '')]
>>> fields = [field for field in fields if field.__len__() > 0]; fields

['"CSV field 1"', ' "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote"', ' "field3"']

In the end you have a list of the fields, with a leading whitepace on all but the first. Might want to handle that.

To do as I do in PowerShell, and extract the stuff between double quotes with support for doubled-up double quotes in the data, you could do something like this:

>>> for field in fields:
...     m = re.match(r'\s*"((?:""|[^"])*)"\s*', field)
...     if m:
...         print m.groups(1)
...
('CSV field 1',)
('field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote',)
('field3',)

JavaScript / Node.js example

Test JavaScript code:

var string = '"CSV field 1", "field2\\, with\\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"'
console.log("Before parsing:\n" + string + "\n\n");
var regex = /((?:\\,|[^,])+)(?:,(\\,|[^,])+)*?/g;
//var regex = /([^,]+),?/g;
console.log("Listing fields:\n\n");
while ((r = regex.exec(string))) {
    console.log(r[1]);
}

Output:

PS C:\temp> node .\test.js
Before parsing:
"CSV field 1", "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote", "", "field4"


Listing fields:


"CSV field 1"
 "field2\, with\, backslash-escaped delimiters and a little ""surprise"" quote"
 ""
 "field4"