13

I thought this will be trivial but I can't get this to work.

Assume a line in a CSV file: "Barack Obama", 48, "President", "1600 Penn Ave, Washington DC"

string[] tokens = line.split(',')

I expect this:

 "Barack Obama"
 48
 "President"
 "1600 Penn Ave, Washington DC"

but the last token is 'Washington DC' not "1600 Penn Ave, Washington DC".

Is there an easy way to get the split function to ignore the comma within quotes?

I have no control over the CSV file and it doesn;t get sent to me. Customer A will be using the app to read files provided by an external individual.

Neil Knight
48.8k26 gold badges136 silver badges193 bronze badges
asked May 11, 2010 at 1:37
5
  • 1
    One option might be to use a different separator, like tab Commented May 11, 2010 at 1:41
  • The leader of the free world can't get his name spelled correctly? Commented May 11, 2010 at 1:53
  • Im not suggesting that he have to USE it. But its a free approach. Many people dont have license or dont want to use a paid software. Commented May 11, 2010 at 1:59
  • Benchmarks included in my answer. If anyone else wants me to benchmark a different solution, I'm happy to... Commented May 11, 2010 at 2:34
  • @Damovisa - Please see my comments for your post. Commented May 11, 2010 at 2:54

9 Answers 9

15

You might have to write your own split function.

  • Iterate through each char in the string
  • When you hit a " character, toggle a boolean
  • When you hit a comma, if the bool is true, ignore it, else, you have your token

Here's an example:

public static class StringExtensions
{
 public static string[] SplitQuoted(this string input, char separator, char quotechar)
 {
 List<string> tokens = new List<string>();
 StringBuilder sb = new StringBuilder();
 bool escaped = false;
 foreach (char c in input)
 {
 if (c.Equals(separator) && !escaped)
 {
 // we have a token
 tokens.Add(sb.ToString().Trim());
 sb.Clear();
 }
 else if (c.Equals(separator) && escaped)
 {
 // ignore but add to string
 sb.Append(c);
 }
 else if (c.Equals(quotechar))
 {
 escaped = !escaped;
 sb.Append(c);
 }
 else
 {
 sb.Append(c);
 }
 }
 tokens.Add(sb.ToString().Trim());
 return tokens.ToArray();
 }
}

Then just call:

string[] tokens = line.SplitQuoted(',','\"');

Benchmarks

Results of benchmarking my code and Dan Tao's code are below. I'm happy to benchmark any other solutions if people want them?

Code:

string input = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\""; // Console.ReadLine()
string[] tokens = null;
// run tests
DateTime start = DateTime.Now;
for (int i = 0; i < 1000000; i++)
 tokens = input.SplitWithQualifier(',', '\"', false);
Console.WriteLine("1,000,000 x SplitWithQualifier = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);
start = DateTime.Now;
for (int i = 0; i<1000000;i++)
 tokens = input.SplitQuoted(',', '\"');
Console.WriteLine("1,000,000 x SplitQuoted = {0}ms", DateTime.Now.Subtract(start).TotalMilliseconds);

Output:

1,000,000 x SplitWithQualifier = 8156.25ms
1,000,000 x SplitQuoted = 2406.25ms
answered May 11, 2010 at 1:50
Sign up to request clarification or add additional context in comments.

14 Comments

Right, because no one else has ever run into this problem before. :)
Imagine that you have a CSV file with at least 5,000 lines. You will iterate through char, replacing it? A simple function will become an operation!
Well yes, a google search will give him obvious answers. And @Erup, of course you have to iterate through every character... how do you parse an entire string otherwise?!
You dont have to do it. Look at Amry answer (if you dont like mine).
If you remove sb.Append(c) from inside the else if (c.Equals(quotechar)) block then your output will be stripped of the text qualifiers as well. Great string extension, thanks for sharing!
|
14

I have a SplitWithQualifier extension method that I use here and there, which utilizes Regex.

I make no claim as to the robustness of this code, but it has worked all right for me for a while.

// mangled code horribly to fit without scrolling
public static class CsvSplitter
{
 public static string[] SplitWithQualifier(this string text,
 char delimiter,
 char qualifier,
 bool stripQualifierFromResult)
 {
 string pattern = string.Format(
 @"{0}(?=(?:[^{1}]*{1}[^{1}]*{1})*(?![^{1}]*{1}))",
 Regex.Escape(delimiter.ToString()),
 Regex.Escape(qualifier.ToString())
 );
 string[] split = Regex.Split(text, pattern);
 if (stripQualifierFromResult)
 return split.Select(s => s.Trim().Trim(qualifier)).ToArray();
 else
 return split;
 }
}

Usage:

string csv = "\"Barak Obama\", 48, \"President\", \"1600 Penn Ave, Washington DC\"";
string[] values = csv.SplitWithQualifier(',', '\"', true);
foreach (string value in values)
 Console.WriteLine(value);

Output:

Barak Obama
48
President
1600 Penn Ave, Washington DC
answered May 11, 2010 at 2:08

2 Comments

I like it - probably better than mine!
This is awesome man! Given that so many people have so many options for attacking this problem - and this is one of the best, do you mind expounding on how your regex works? Thanks so much.
5

I see from the bigger picture that you are actually trying to parse CSV input. So instead of advising on how to split the string properly, I would instead recommend you to use a CSV parser to do this kind of thing.

A Fast CSV Reader

One that I would recommend is the library (source code available) that you can get from this CodeProject page: http://www.codeproject.com/KB/database/CsvReader.aspx

I personally use it myself and like it. It's a .NET native code and a lot faster than using OLEDB (which also can do the CSV parsing for you, but believe me, it's slow).

answered May 11, 2010 at 1:54

3 Comments

I'll look into this. Thanks. I like the idea of a cvsReader so definitely this is something I'll add to my toolbox.
I use this library as well. It's pretty good, although I've had some minor issues with extreme edge cases.
@Emil Lerch - can you give some examples? I'm sure the problems can be fixed if you point them out.
1

You should be using Microsoft.VisualBasic.FileIO.TextFieldParser for that. It will handle all the CSV stuff correctly for you, see: A similar question with example using the TextFieldParser

PS: Do not fear using the Microsoft.VisualBasic dll in a C# project, it's all .NET :-)

answered May 11, 2010 at 2:28

Comments

0

You can't parse a CSV line with a simple Split on commas, because some cell contents will contain commas that aren't meant to delineate data but are actually part of the cell contents themselves.

Here is a link to a simple regex-based C# method that will convert your CSV into a handly DataTable:

http://www.hotblue.com/article0000.aspx?a=0006

Working with DataTables is very easy - let me know if you need a code sample for that.

answered May 11, 2010 at 1:51

1 Comment

I'm going with Dan Tao for now. Thanks for you time.
0

That would be the expected behavior as quotes are just another string character in C#. Looks like what you are after is the quoted tokens or numeric tokens.

I think you might need to use Regex to split the strings unless some one else knows a better way.

Or you could just loop through the string one character at a time building up the string as you go and build the tokens that way. It's old school but may be the most reliable way in your case.

answered May 11, 2010 at 1:48

Comments

0

I would recommend using a regular expression instead. It will allow you to extract more complicated substrings in a much more versatile manner (precisely as you want).

http://www.c-sharpcorner.com/uploadfile/prasad_1/regexppsd12062005021717am/regexppsd.aspx

http://oreilly.com/windows/archive/csharp-regular-expressions.html

answered May 11, 2010 at 2:02

Comments

-1

Can't you change how the CSV is generated? Using OpenOffice, you can set the char separator (use ;) and how the string is delimited (using " or ').

It would be like this: 'President';'1600 Penn Ave, Washington DC'

answered May 11, 2010 at 1:47

4 Comments

Just perform it on the "Save As" operation!
No matter what delimeter you choose, you still have to worry about it being in the data somewhere. Besides, he might get the CSV files from some external source he has no control over.
Yes. But its a quick alternative if he can open and change it.
No, it's too late to change the format of the csv file (as in seven years two late)
-2

string temp = line.Replace( "\"", "" );

string[] tokens = temp.Split(',')

answered May 11, 2010 at 1:57

2 Comments

That will definitely not do what he wants - he wants to keep the quoted strings intact - this code will remove quotes and then split them anyway
I can't modify the original strings.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.