I've created a CSV parser that tries to build a string table out of a CSV file. The goal is to handle CSV files as well as Excel.
Input CSV file:
First field of first row,"This field is multiline but that's OK because it's enclosed in double qoutes, and this is an escaped "" double qoute" but this one "" is not "This is second field of second row, but it is not multiline because it doesn't start with an immediate double quote"
What Excel shows:
enter image description here
My code:
#include <vector>
#include <string>
// Returns a pointer to the start of the next field, or zero if this is the
// last field in the CSV p is the start position of the field sep is the
// separator used, i.e. comma or semicolon newline says whether the field ends
// with a newline or with a comma
const wchar_t* nextCsvField(const wchar_t *p, wchar_t sep, bool *newline, const wchar_t **escapedEnd)
{
*escapedEnd = 0;
*newline = false;
// Parse quoted sequences
if ('"' == p[0]) {
p++;
while (1) {
// Find next double-quote
p = wcschr(p, L'"');
// Check for "", it is an escaped double-quote
if (p[1] != '"') {
*escapedEnd = p;
break;
}
// If we don't find it or it's the last symbol
// then this is the last field
if (!p || !p[1])
return 0;
// Skip the escaped double-quote
p += 2;
}
}
// Find next newline or comma.
wchar_t newline_or_sep[4] = L"\n\r ";
newline_or_sep[2] = sep;
p = wcspbrk(p, newline_or_sep);
// If no newline or separator, this is the last field.
if (!p)
return 0;
// Check if we had newline.
*newline = (p[0] == '\r' || p[0] == '\n');
// Handle "\r\n", otherwise just increment
if (p[0] == '\r' && p[1] == '\n')
p += 2;
else
p++;
return p;
}
typedef std::vector<std::vector<std::wstring> > StringTable;
// Parses the CSV data and constructs a StringTable
// from the fields in it.
StringTable parseCsv(const wchar_t *csvData, wchar_t sep)
{
StringTable v;
// Return immediately if the CSV data is empty.
if (!csvData || !csvData[0])
return v;
v.resize(1);
// Here we CSV fields and fill the output StringTable.
while (csvData) {
// Call nextCsvField.
bool newline;
const wchar_t *escapedEnd;
const wchar_t *next = nextCsvField(csvData, sep, &newline, &escapedEnd);
// Add new field to the current row.
v.back().resize(v.back().size() + 1);
std::wstring &field = v.back().back();
// If there is a part that is escaped with double-quotes, add it
// (without the opening and closing double-quote, and also with any ""
// escaped to ". After that csvData is set to the part immediately
// after the closing double-quote, so anything after the closing
// double-quote is added as well (but unescaped).
if (escapedEnd) {
for (const wchar_t *ii = csvData + 1; ii != escapedEnd; ii++) {
field += *ii;
if ('"' == ii[0] && '"' == ii[1])
ii++;
}
csvData = escapedEnd + 1;
}
// If there was no escaped part, or the CSV is malformed, add anything
// else "as is" (i.e. unescaped). Keep in mind that next might be NULL.
if (next)
field.append(csvData, next-1);
else
field += csvData;
// If the field ends with a newline, add next row to the StringTable.
if (newline) {
if (field.empty())
v.back().pop_back();
v.resize(v.size() + 1);
}
// Set csvData to point to the start of the next field for the next
// cycle.
csvData = next;
}
// If the CSV ends with a newline, then there is an empty row added
// (actually it's a row with a single empty string). We trim that empty
// row here.
if (v.back().empty() || (v.back().size() == 1 && v.back().front().empty()))
v.pop_back();
}
What I'm interested in:
Whether my code handles all the corner cases, especially empty elements, lines, etc., and any mistakes I might have made.
1 Answer 1
I'm not sure it's best to return
0
fromnextCsvField()
. Since the function is to return a pointer, consider returningNULL
(ornullptr
if you have C++11). This function also shouldn't return aconst
pointer if the return value (a validwchar_t
) will be modified.You're using a "Yoda condition" here:
if ('"' == p[0])
This isn't too common, and it may still be prone to error. Either way, you should have your compiler warnings up high so that any accidental assignments in conditions will be reported.
You should still use curly braces for single-line statements, as it could benefit maintenance.
"
after quote? \$\endgroup\$