In this challenge, given a CSV file as a string, you'll return the data contained as a 2d array of strings.
Spec:
- The input consists of one or more records, delimited with
\r\n
(CRLF),\n
(line feed), or some other reasonable newline sequence of your choice - A record consists of one or more fields, delimited with
,
(commas),;
(semicolons),\t
(tabs), or some other reasonable delimiter of your choice. You may assume each record contains the same number of fields - A field can either be quoted or un-quoted. An un-quoted field will not contain a field delimiter, newline sequence, or quote character, but may otherwise be an arbitrary non-empty string (within reason; you can assume things like invalid UTF-8, null characters, or unpaired surrogates won't appear if doing so is justifiable)
- A quoted string will be delimited on both sides by a quote character, which will be
"
(double quotes),'
(single quotes),`
(backticks), or some other reasonable delimiter of your choice. This character will be escaped within the string by being repeated twice (the string unquoteda"bc
would be represented as"a""bc"
). Quoted strings may contain field delimiters, newline sequences, escaped quote characters, and any other sequences of characters which may appear in an unquoted field. Quoted strings will always be non-empty - You may choose to ignore the first record (as if it is a header record)
Output:
You may represent a 2d array in any reasonable format (excluding CSV or any of it's variants for obvious reasons).
Scoring:
This is code-golf, so shortest answer (in bytes) per language wins.
Test cases:
In:
1,2
abc,xyz
Out:
[["1","2"],["abc","xyz"]]
In:
"abc",xyz
"a""bc",""""
",",";"
"
"," "
Out:
[["abc","xyz"],["a\"bc","\""],[",",";"],["\n"," "]]
In:
"\
,"""
Out:
[["\\\n,\""]]
$$$$$$$$
10 Answers 10
JavaScript (Node.js), 113 bytes
s=>s.match(/"([^"]|"")*"|\n|[^,\n]+/g).map(v=>v<' '?t.push(r=[]):r.push(v.replace(/^"|"("?)/g,'1ドル')),t=[r=[]])&&t
PowerShell, 15 bytes
ConvertFrom-Csv
Finally something where a rather verbose language like PowerShell can make use of a built-in function.
ConvertFrom-Csv
covers all the tricky cases.
The test cases in the TIO have an added header record, based on "You may choose to ignore the first record (as if it is a header record)"
Output in the TIO both as a table and JSON to show that the input was correctly parsed.
-
\$\begingroup\$ And they say Factor isn't competitive, love it \$\endgroup\$south– south2023年04月13日 12:25:10 +00:00Commented Apr 13, 2023 at 12:25
Retina 0.8.2, 48 bytes
mM!`"([^"]|"")*"|[^,¶]+|$
"(([^"]|"")*)"
1ドル
""
"
Try it online! Retina doesn't have an array type, so I'm just listing each string on its own line with records double-spaced from each other. Explanation:
mM!`"([^"]|"")*"|[^,¶]+|$
Extract all of the fields, plus insert a blank line at the end of each record.
"(([^"]|"")*)"
1ドル
Remove the outer quotes from quoted fields.
""
"
Unquote the inner quotes.
39 bytes in Retina 1:
mL$`"(([^"]|"")*)"|([^,¶]+)|$
1ドル3ドル
""
"
Try it online! Explanation: Retina 1's $
flag makes a substitution of the listed matches, collapsing the first two stages of my Retina 0.8.2 program into one.
-
\$\begingroup\$ @tsh Thanks, fixed. \$\endgroup\$Neil– Neil2023年04月14日 07:42:32 +00:00Commented Apr 14, 2023 at 7:42
APL (Dyalog APL), 10 bytes
{⎕CSV⍵'S'}
Builtins ftw. If a filename would a valid input format, just ⎕CSV
would do. The 'S'
indicates input from a [S]tring.
Bash + ShellShoccar-jpn/Parsrs, 9 bytes
parsrc.sh
Trivial answer.
Test runs
~/wkdir $ parsrc.sh i.1|cat -A
1 1 1$
1 2 2$
2 1 abc$
2 2 xyz$
~/wkdir $ parsrc.sh i.2|cat -A
1 1 abc$
1 2 xyz$
2 1 a"bc$
2 2 ""$
3 1 ,$
3 2 ;$
4 1 \n$
4 2 $
~/wkdir $ parsrc.sh i.3|cat -A
1 1 \\\n,"$
~/wkdir $
Go, 112 bytes
import(."encoding/csv";S"strings")
func f(s string)[][]string{o,_:=NewReader(S.NewReader(s)).ReadAll()
return o}
Uses the CSV reader from the standard library.
-
\$\begingroup\$ 112 bytes \$\endgroup\$The Thonnu– The Thonnu2023年04月13日 15:15:27 +00:00Commented Apr 13, 2023 at 15:15
Haskell + hgl, 66 bytes
g=χ'
gk$(g*>my((g<*g)#|nχ '\'')<*g)#|rX"[^,\n]*"$|$ʃ","$|$ʃ NL
Uses '
as a quote delimeter
Haskell + hgl, (削除) 71 (削除ここまで), 68 bytes
g=χ '"'
gk$(g*>my((g<*g)#|nχ '"')<*g)#|rX"[^,\n]*"$|$ʃ","$|$ʃ NL
Uses "
as a quote delimiter.
Explanation
First we set up a parser to parse a single quote. We are going to use this a bunch of times:
g=χ '"'
Now we build a parser for unquoted fields:
rX"[^,\n]*"
This parses any number of characters that aren't either a comma or a newline using a regex.
Next we have quoted fields, which are parsed with:
g*>my((g<*g)#|nχ '"')<*g
g*>...<*g
means it has to start and end with a quote and that those quotes do not contribute to the value.
g<*g
parses two quotes but returns a single one. nχ '"'
parses anything other than a quote character. We combine them to get a parser which either parses two quotes or a single non-quote character, but never a single quote. We use my
to repeat this an arbitrary number of times.
Next we combine these two parsers with #|
. #|
means that it will try the first one (quoted field) and then try the second iff the first failed. That means if it finds a legal quoted field it won't bother with unquoted fields. This makes the field parser.
To make a record parser we just interleave ,
s, by adding $|$ʃ","
. To make the CSV parser we just interleave newlines into our record parser with $|$ʃ NL
.
Finally once we have a parser we use gk
to request the first complete parse.
Reflection
This is very long. There's lots to improve here:
χ '"'
andʃ"\""
andχ Nl
andʃ NL
should all exist. These are common characters for code golf and it's ridiculous that I have to build these parsers manually.- In an earlier version I used
kB<fE
to parse any character not in a blacklist. It ended up being shorter to use a regex for this. That means this should definitely exist! And to be honest, I'm a bit shocked it doesn't. - There should be a version of
cM
which uses(#|)
. I really feel like there might be, but I can't find it. - Should probably be something like
sqs x y = x *> y <* x
. This is comes up pretty frequently. - Regex does help save a couple of bytes here, but it could save way more. The unquoted field parser is a good deal bloated. The suggestions above would help trim things down, but it would be nice if we could have replaced it with a regex. But we can't, because hgl-regex currently can't do what that parser needs to do. The issue is that a hgl regex can only really take a chunk off the front of a string. It can't manipulate that chunk at all. That means
g<*g
, a parser which consumes 2"
s but returns only 1, is impossible to express in a regex. I have two solutions to this:- Add a
_
modifier to the regex which causes the the group to be ignored from the output. So(ab)_cd
parsesabcd
but only returns thecd
. This could be extended to include not just silencing a particular part but also replacing it with a new value. With this the parser could be:
or if we chose to userX"\"_(\"_\"|^\")*\"_"
'
as the quote delimiter:
Both options are shorter than the parser we have right now.rX"'_('_'|^')*'_"
- Add a way to pass a custom parser to be referenced inside a regex:
or if we chose to userXw(i$>g)"/p(/p\"|^\")*/p"
'
as the quote delimiter:
Both would save bytes, but not compared to the last solution, however this method is way more flexible than the previous solution, since non-regex parsers can be made to do basically anything.rXw(g$>i)"/p(/p'|^')*/p"
- Add a
-
\$\begingroup\$ Is this a custom library? \$\endgroup\$xnor– xnor2023年04月16日 05:51:13 +00:00Commented Apr 16, 2023 at 5:51
-
\$\begingroup\$ @xnor Yes. The auto-formatting doesn't include the library so I accidentally lost it when I golfed the answer. \$\endgroup\$2023年04月16日 11:53:00 +00:00Commented Apr 16, 2023 at 11:53
Python 3, (削除) 69 (削除ここまで) 59 bytes
lambda d:print([*csv.reader(io.StringIO(d))])
import csv,io
-10 with thanks to @Neil
Not sure if the output format is valid as it does not exactly match the test cases (uses single quotes around each element). Please comment if not and I can fix it with a json.dumps()
-
1\$\begingroup\$
[*csv.reader(io.StringIO(d))]
saves you 10 bytes. \$\endgroup\$Neil– Neil2023年04月13日 17:02:02 +00:00Commented Apr 13, 2023 at 17:02
",","
. I think that one of them should be parsed as an unquoted"
, but I'm not sure which. \$\endgroup\$