Parse a CSV file

Question 1

In this challenge, given a CSV file as a string, you'll return the data contained as a 2d array of strings.

Spec:

The input consists of one or more records, delimited with \r\n (CRLF), \n (line feed), or some other reasonable newline sequence of your choice
A record consists of one or more fields, delimited with , (commas), ; (semicolons), \t (tabs), or some other reasonable delimiter of your choice. You may assume each record contains the same number of fields
A field can either be quoted or un-quoted. An un-quoted field will not contain a field delimiter, newline sequence, or quote character, but may otherwise be an arbitrary non-empty string (within reason; you can assume things like invalid UTF-8, null characters, or unpaired surrogates won't appear if doing so is justifiable)
A quoted string will be delimited on both sides by a quote character, which will be " (double quotes), ' (single quotes), ` (backticks), or some other reasonable delimiter of your choice. This character will be escaped within the string by being repeated twice (the string unquoted a"bc would be represented as "a""bc"). Quoted strings may contain field delimiters, newline sequences, escaped quote characters, and any other sequences of characters which may appear in an unquoted field. Quoted strings will always be non-empty
You may choose to ignore the first record (as if it is a header record)

Output:

You may represent a 2d array in any reasonable format (excluding CSV or any of it's variants for obvious reasons).

Scoring:

This is code-golf, so shortest answer (in bytes) per language wins.

Test cases:

In:

1,2
abc,xyz

Out:

[["1","2"],["abc","xyz"]]

In:

"abc",xyz
"a""bc",""""
",",";"
"
"," "

Out:

[["abc","xyz"],["a\"bc","\""],[",",";"],["\n"," "]]

In:

"\
,"""

Out:

[["\\\n,\""]]

$$$$$$$$

Question 2

"given a CSV file" is that meant to be taken literally, or can the input be taken as a string by a function?

Question 3

@chunes As a string, my bad

Question 4

I like how the spec is as precise as the official CSV spec

Question 5

@Cruncher Yeah, parsing the quotes is kind of the main part of the challenge. I'm not sure why you think it would be more interesting if the challenge was just splitting on newlines and commas, or if a library was used on this version.

Question 6

Could we have an example with an odd number of quotes? E.g. ",",". I think that one of them should be parsed as an unquoted ", but I'm not sure which.

Question 7

JavaScript (Node.js), 113 bytes

s=>s.match(/"([^"]|"")*"|\n|[^,\n]+/g).map(v=>v<' '?t.push(r=[]):r.push(v.replace(/^"|"("?)/g,'1ドル')),t=[r=[]])&&t

Try it online!

Question 8

PowerShell, 15 bytes

ConvertFrom-Csv

Try it online!

Finally something where a rather verbose language like PowerShell can make use of a built-in function.
ConvertFrom-Csv covers all the tricky cases.
The test cases in the TIO have an added header record, based on "You may choose to ignore the first record (as if it is a header record)"
Output in the TIO both as a table and JSON to show that the input was correctly parsed.

Question 9

Factor + `csv`, 10 bytes

string>csv

Try it online!

Question 10

And they say Factor isn't competitive, love it

Question 11

Retina 0.8.2, 48 bytes

mM!`"([^"]|"")*"|[^,¶]+|$
"(([^"]|"")*)"
1ドル
""
"

Try it online! Retina doesn't have an array type, so I'm just listing each string on its own line with records double-spaced from each other. Explanation:

mM!`"([^"]|"")*"|[^,¶]+|$

Extract all of the fields, plus insert a blank line at the end of each record.

"(([^"]|"")*)"
1ドル

Remove the outer quotes from quoted fields.

""
"

Unquote the inner quotes.

39 bytes in Retina 1:

mL$`"(([^"]|"")*)"|([^,¶]+)|$
1ドル3ドル
""
"

Try it online! Explanation: Retina 1's $ flag makes a substitution of the listed matches, collapsing the first two stages of my Retina 0.8.2 program into one.

Question 12

@tsh Thanks, fixed.

Question 13

Wolfram Language (Mathematica), 21 bytes

#~ImportString~"CSV"&

Try it online!

Question 14

APL (Dyalog APL), 10 bytes

{⎕CSV⍵'S'}

Attempt This Online!

Builtins ftw. If a filename would a valid input format, just ⎕CSV would do. The 'S' indicates input from a [S]tring.

Question 15

Bash + ShellShoccar-jpn/Parsrs, 9 bytes

parsrc.sh

Trivial answer.

Test runs

~/wkdir $ parsrc.sh i.1|cat -A
1 1 1$
1 2 2$
2 1 abc$
2 2 xyz$
~/wkdir $ parsrc.sh i.2|cat -A
1 1 abc$
1 2 xyz$
2 1 a"bc$
2 2 ""$
3 1 ,$
3 2 ;$
4 1 \n$
4 2 $
~/wkdir $ parsrc.sh i.3|cat -A
1 1 \\\n,"$
~/wkdir $

Question 16

Go, 112 bytes

import(."encoding/csv";S"strings")
func f(s string)[][]string{o,_:=NewReader(S.NewReader(s)).ReadAll()
return o}

Attempt This Online!

Uses the CSV reader from the standard library.

Question 17

112 bytes

Question 18

Haskell + hgl, 66 bytes

g=χ'

gk$(g*>my((g<*g)#|nχ '\'')<*g)#|rX"[^,\n]*"$|$ʃ","$|$ʃ NL

Attempt This Online!

Uses ' as a quote delimeter

Haskell + hgl, (削除) 71 (削除ここまで), 68 bytes

g=χ '"'

gk$(g*>my((g<*g)#|nχ '"')<*g)#|rX"[^,\n]*"$|$ʃ","$|$ʃ NL

Attempt This Online!

Uses " as a quote delimiter.

Explanation

First we set up a parser to parse a single quote. We are going to use this a bunch of times:

g=χ '"'

Now we build a parser for unquoted fields:

rX"[^,\n]*"

This parses any number of characters that aren't either a comma or a newline using a regex.

Next we have quoted fields, which are parsed with:

g*>my((g<*g)#|nχ '"')<*g

g*>...<*g means it has to start and end with a quote and that those quotes do not contribute to the value.

g<*g parses two quotes but returns a single one. nχ '"' parses anything other than a quote character. We combine them to get a parser which either parses two quotes or a single non-quote character, but never a single quote. We use my to repeat this an arbitrary number of times.

Next we combine these two parsers with #|. #| means that it will try the first one (quoted field) and then try the second iff the first failed. That means if it finds a legal quoted field it won't bother with unquoted fields. This makes the field parser.

To make a record parser we just interleave ,s, by adding $|$ʃ",". To make the CSV parser we just interleave newlines into our record parser with $|$ʃ NL.

Finally once we have a parser we use gk to request the first complete parse.

Reflection

This is very long. There's lots to improve here:

χ '"' and ʃ"\"" and χ Nl and ʃ NL should all exist. These are common characters for code golf and it's ridiculous that I have to build these parsers manually.
In an earlier version I used kB<fE to parse any character not in a blacklist. It ended up being shorter to use a regex for this. That means this should definitely exist! And to be honest, I'm a bit shocked it doesn't.
There should be a version of cM which uses (#|). I really feel like there might be, but I can't find it.
Should probably be something like sqs x y = x *> y <* x. This is comes up pretty frequently.
Regex does help save a couple of bytes here, but it could save way more. The unquoted field parser is a good deal bloated. The suggestions above would help trim things down, but it would be nice if we could have replaced it with a regex. But we can't, because hgl-regex currently can't do what that parser needs to do. The issue is that a hgl regex can only really take a chunk off the front of a string. It can't manipulate that chunk at all. That means g<*g, a parser which consumes 2 "s but returns only 1, is impossible to express in a regex. I have two solutions to this:
- Add a _ modifier to the regex which causes the the group to be ignored from the output. So (ab)_cd parses abcd but only returns the cd. This could be extended to include not just silencing a particular part but also replacing it with a new value. With this the parser could be:
```
rX"\"_(\"_\"|^\")*\"_"
```
  or if we chose to use ' as the quote delimiter:
```
rX"'_('_'|^')*'_"
```
  Both options are shorter than the parser we have right now.
- Add a way to pass a custom parser to be referenced inside a regex:
```
rXw(i$>g)"/p(/p\"|^\")*/p"
```
  or if we chose to use ' as the quote delimiter:
```
rXw(g$>i)"/p(/p'|^')*/p"
```
  Both would save bytes, but not compared to the last solution, however this method is way more flexible than the previous solution, since non-regex parsers can be made to do basically anything.

Question 19

Is this a custom library?

Question 20

@xnor Yes. The auto-formatting doesn't include the library so I accidentally lost it when I golfed the answer.

Question 21

Python 3, (削除) 69 (削除ここまで) 59 bytes

lambda d:print([*csv.reader(io.StringIO(d))])
import csv,io

Try it online!

-10 with thanks to @Neil

Not sure if the output format is valid as it does not exactly match the test cases (uses single quotes around each element). Please comment if not and I can fix it with a json.dumps()

Question 22

[*csv.reader(io.StringIO(d))] saves you 10 bytes.

tsh tsh 36.1k2 gold badges36 silver badges132 bronze badges · Accepted Answer · 2023-04-13 08:52:39Z

7

\$\begingroup\$

JavaScript (Node.js), 113 bytes

s=>s.match(/"([^"]|"")*"|\n|[^,\n]+/g).map(v=>v<' '?t.push(r=[]):r.push(v.replace(/^"|"("?)/g,'1ドル')),t=[r=[]])&&t

Try it online!

Share

Improve this answer

edited Apr 14, 2023 at 6:22

answered Apr 13, 2023 at 8:52

tsh's user avatar

tsh tsh

36.1k2 gold badges36 silver badges132 bronze badges

\$\endgroup\$

Add a comment |

Stack Exchange Network

Parse a CSV file

10 Answers 10

JavaScript (Node.js), 113 bytes

PowerShell, 15 bytes

Factor + `csv`, 10 bytes

Retina 0.8.2, 48 bytes

Wolfram Language (Mathematica), 21 bytes

APL (Dyalog APL), 10 bytes

Bash + ShellShoccar-jpn/Parsrs, 9 bytes

Test runs

Go, 112 bytes

Haskell + hgl, 66 bytes

Haskell + hgl, (削除) 71 (削除ここまで), 68 bytes

Explanation

Reflection

Python 3, (削除) 69 (削除ここまで) 59 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parse a CSV file

10 Answers 10

JavaScript (Node.js), 113 bytes

PowerShell, 15 bytes

Factor + csv, 10 bytes

Retina 0.8.2, 48 bytes

Wolfram Language (Mathematica), 21 bytes

APL (Dyalog APL), 10 bytes

Bash + ShellShoccar-jpn/Parsrs, 9 bytes

Test runs

Go, 112 bytes

Haskell + hgl, 66 bytes

Haskell + hgl, (削除) 71 (削除ここまで), 68 bytes

Explanation

Reflection

Python 3, (削除) 69 (削除ここまで) 59 bytes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Factor + `csv`, 10 bytes