I thought this will be trivial but I can't get this to work.
Assume a line in a CSV file:
"Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"
string[] tokens = line.split(',')
I expect this:
"Barack Obama"
48
"President"
"1600 Penn Ave, Washington DC"
but the last token is
'Washington DC' not
"1600 Penn Ave, Washington DC".
Is there an easy way to get the split function to ignore the comma within quotes?
I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.
-
1One option might be to use a different separator, like tabDimitar– Dimitar2010年05月11日 01:41:08 +00:00Commented May 11, 2010 at 1:41
-
The leader of the free world can't get his name spelled correctly?MusiGenesis– MusiGenesis2010年05月11日 01:53:39 +00:00Commented May 11, 2010 at 1:53
-
Im not suggesting that he have to USE it. But its a free approach. Many people dont have license or dont want to use a paid software.Erup– Erup2010年05月11日 01:59:59 +00:00Commented May 11, 2010 at 1:59
-
Benchmarks included in my answer. If anyone else wants me to benchmark a different solution, I'm happy to...Damovisa– Damovisa2010年05月11日 02:34:12 +00:00Commented May 11, 2010 at 2:34
-
@Damovisa - Please see my comments for your post.ritu– ritu2010年05月11日 02:54:45 +00:00Commented May 11, 2010 at 2:54
9 Answers 9
You might have to write your own split function.
- Iterate through each char in the string
- When you hit a
"character, toggle a boolean - When you hit a comma, if the bool is true, ignore it, else, you have your token
Here's an example:
public static class StringExtensions
{
public static string[] SplitQuoted(this string input, char separator, char quotechar)
{
List<string> tokens = new List<string>();
StringBuilder sb = new StringBuilder();
bool escaped = false;
foreach (char c in input)
{
if (c.Equals(separator) && !escaped)
{
// we have a token
tokens.Add(sb.ToString().Trim());
sb.Clear();
}
else if (c.Equals(separator) && escaped)
{
// ignore but add to string
sb.Append(c);
}
else if (c.Equals(quotechar))
{
escaped = !escaped;
sb.Append(c);
}
else
{
sb.Append(c);
}
}
tokens.Add(sb.ToString().Trim());
return tokens.ToArray();
}
}
Then just call:
string[] tokens = line.SplitQuoted(',','\"');
Benchmarks
Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?
Code:
string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;
// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
start = DateTime.Now;
for (int i = 0; i<1000000;i++)
tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
Output:
1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted = 2406.25ms
14 Comments
sb.Append(c) from inside the else if (c.Equals(quotechar)) block then your output will be stripped of the text qualifiers as well. Great string extension, thanks for sharing!I have a SplitWithQualifier extension method that I use here and there, which utilizes Regex.
I make no claim as to the robustness of this code, but it has worked all right for me for a while.
// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
public static string[] SplitWithQualifier(this string text,
char delimiter,
char qualifier,
bool stripQualifierFromResult)
{
string pattern = string.Format(
@"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
Regex.Escape(delimiter.ToString()),
Regex.Escape(qualifier.ToString())
);
string[] split = Regex.Split(text, pattern);
if (stripQualifierFromResult)
return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
else
return split;
}
}
Usage:
string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);
foreach (string value in values)
Console.WriteLine(value);
Output:
Barak Obama
48
President
1600 Penn Ave, Washington DC
I see from the bigger picture that you are actually trying to parse CSV input. So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing.
A Fast CSV Reader
One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx
I personally use it myself and like it. It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow).
3 Comments
You should be using Microsoft.VisualBasic.FileIO.TextFieldParser for that. It will handle all the CSV stuff correctly for you, see: A similar question with example using the TextFieldParser
PS: Do not fear using the Microsoft.VisualBasic dll in a C# project, it's all .NET :-)
Comments
You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves.
Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable:
http://www.hotblue.com/article0000.aspx?a=0006
Working with DataTables is very easy - let me know if you need a code sample for that.
1 Comment
That would be the expected behavior as quotes are just another string character in C#. Looks like what you are after is the quoted tokens or numeric tokens.
I think you might need to use Regex to split the strings unless some one else knows a better way.
Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. It's old school but may be the most reliable way in your case.
Comments
I would recommend using a regular expression instead. It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want).
http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx
http://oreilly.com/windows/archive/csharp-regular-expressions.html
Comments
Can't you change how the CSV is generated? Using OpenOffice, you can set the char separator (use ;) and how the string is delimited (using " or ').
It would be like this: 'President';'1600 Penn Ave, Washington DC'
4 Comments
string temp = line.Replace( "\"", "" );
string[] tokens = temp.Split(',')