CSV reader using StreamReader and LINQ

Question 1

Mostly due to readability, I am using the following code to

find a specific csv file on a drive
read that CSV file using a streamReader
parse it into a List<string[]> to be able to use LINQ and/or PLINQ later on

However, the process takes about 4-5 seconds, which is simply put too long. Any suggestions on how to improve the following (or maybe even replace it)?

var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
 where x.Contains(dateTime)
 select x).First();
 #region Read CSV File
 List<string[]> _tempList = new List<string[]>();
 FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
 using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))
 {
 string currentLine;
 while ((currentLine = reader.ReadLine()) != null)
 {
 string[] temp = currentLine.Split(new char[] { ',' }, StringSplitOptions.None);
 _tempList.Add(temp);
 }
 }
 #endregion

The order inside the CSV file is important - the file will contain between (2000-7000) x 25 entries.

The CSV file has several fields which I am not interested in - and I don't necessarily need them in my List<string[]> (or string[][]). I have tried to filter them with a LINQ statement similar to the following pseudo code:

var query = from x in MyListOfStringArray
 select new {Col1 = x[1] , Col2 = c[4]}.ToArray();

This approach results in a strange type for query - and for that matter was pretty slow (I would have an array with about 9-11 properties). Any ideas on that second issue?

Question 2

Are you sure you're timing the correct part of the code (ie not including query)? I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s.

Question 3

var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
 where x.Contains(dateTime)
 select x).First();

If Directory.GetFiles(_path, "*.csv") doesn't return any files where x.Contains(dateTime), this line will blow up with an InvalidOperationException that might be a bit off-putting / surprising.

I find the query would be clearer with a 100% method syntax - this would be equivalent:

var query = Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(dateTime))
 .First();

The name query is wrong/misleading. It would be a query if it were an IQueryable<string> (i.e. up to right after the .Where() call)... but it's not, var actually stands for string and what you call query is actually a fileName: .First() materializes the query, and returns the first element... or throws an exception.

It would be better to call .FirstOrDefault(), which would make your result null if no files matched your search criteria.

I don't understand why you're using var here, given that you're using explicit types everywhere else.. especially since var is just a string. Don't get me wrong though: I prefer seeing var here... and everywhere else.

#region Read CSV File

Using #region inside a method clearly says your method is doing more than one thing. Whenever you do that, you're missing a refactoring opportunity. Consider extracting a method here:

private string GetFileName(string formattedDateTime)
{
 return Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(formattedDateTime))
 .FirstOrDefault();
}

(lacking context to recommend passing in an actual DateTime)

That leaves you with a method that (hopefully) now only has this #region block, which is now completely redundant and can be discarded (if there's still more to it, extract methods as needed)

That said...

FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))

Disposing the StreamReader will also cleanly close and dispose the FileStream, but that is far from obvious unless you know it, and if Jon Skeet says it's best practice to put it in its own using block, then I can only agree with him.

Your code can use a variable (declared outside the loop's body) for the value separator: you're creating a new character array for every single row you're reading!

char[] separator = new char[] { ',' };

Or more succinctly:

var separator = new[] { ',' };

This makes the temp assignment even clearer/cleaner:

string[] temp = currentLine.Split(separator, StringSplitOptions.None);

Looking at the currentLine, it's far from obvious what's going on here:

while ((currentLine = reader.ReadLine()) != null)

Using the result of an assignment rarely results in easily readable code. I like how it cleverly combines reading the next line into currentLine and verifying whether that's a null value though... so I'm not sure I'd recommend rewriting that part after all.

It's not clear what happens to your List<string[]> when you're done reading the file, so I'll assume you're just returning it. In that case, you could consider yielding the results instead:

private IEnumerable<string[]> ReadCsv(string fileName)
{
 char[] separator = new[] { ',' };
 string currentLine;
 using (var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
 using (var reader = new StreamReader(stream, Encoding.Default, true, 1024))
 {
 while ((currentLine = reader.ReadLine()) != null)
 {
 yield return currentLine.Split(separator, StringSplitOptions.None);
 }
 }
}

Question 4

Amazing! Thank you so much! I was really not prepared for both the quality and dedication of your post.

Question 5

Always a pleasure! Make sure you take @mjolka's answer into account as well, he's got very good points too :)

Question 6

Encoding.Default is possibly not the correct encoding for the file (and may change over time). The detectEncodingFromByteOrderMarks parameter you pass to the StreamReader constructor only helps if the file start with byte-order marks. You should know how your files are encoded, and pass the appropriate encoding, e.g. Encoding.UTF8.

A more concise way to read the lines of a file is File.ReadLines, though it doesn't specify ReadWrite share access (which is something I'm not sure you want, either).

As for timing, I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s (not including query). So my guess is that it's query that is taking most of your time. Try it without AsParallel() and see if that improves performance.

Question 7

Thank you for your advise, I must admit that I don't know anything about encoding or what a byte-order marks is. I require the Share access - the file is being written to from someone else.

Question 8

@Aeolus Suppose you're writing code to read a video file. You would need to know the format of the file: is it AVI, MP4, Ogg, etc? Similarly, when reading a text file you need to know the file's encoding: is it UTF-8, UTF-16, US-ASCII, etc?

Question 9

Is there an easy way to find out ? My very short analysis leads to the conclusion that 80% of all cases UTF-8 is a pretty solid guess.

score 9 · Accepted Answer · 2014-09-05 01:53:55Z

var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
 where x.Contains(dateTime)
 select x).First();

If Directory.GetFiles(_path, "*.csv") doesn't return any files where x.Contains(dateTime), this line will blow up with an InvalidOperationException that might be a bit off-putting / surprising.

I find the query would be clearer with a 100% method syntax - this would be equivalent:

var query = Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(dateTime))
 .First();

The name query is wrong/misleading. It would be a query if it were an IQueryable<string> (i.e. up to right after the .Where() call)... but it's not, var actually stands for string and what you call query is actually a fileName: .First() materializes the query, and returns the first element... or throws an exception.

It would be better to call .FirstOrDefault(), which would make your result null if no files matched your search criteria.

I don't understand why you're using var here, given that you're using explicit types everywhere else.. especially since var is just a string. Don't get me wrong though: I prefer seeing var here... and everywhere else.

#region Read CSV File

Using #region inside a method clearly says your method is doing more than one thing. Whenever you do that, you're missing a refactoring opportunity. Consider extracting a method here:

private string GetFileName(string formattedDateTime)
{
 return Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(formattedDateTime))
 .FirstOrDefault();
}

(lacking context to recommend passing in an actual DateTime)

That leaves you with a method that (hopefully) now only has this #region block, which is now completely redundant and can be discarded (if there's still more to it, extract methods as needed)

That said...

FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))

Disposing the StreamReader will also cleanly close and dispose the FileStream, but that is far from obvious unless you know it, and if Jon Skeet says it's best practice to put it in its own using block, then I can only agree with him.

Your code can use a variable (declared outside the loop's body) for the value separator: you're creating a new character array for every single row you're reading!

char[] separator = new char[] { ',' };

Or more succinctly:

var separator = new[] { ',' };

This makes the temp assignment even clearer/cleaner:

string[] temp = currentLine.Split(separator, StringSplitOptions.None);

Looking at the currentLine, it's far from obvious what's going on here:

while ((currentLine = reader.ReadLine()) != null)

Using the result of an assignment rarely results in easily readable code. I like how it cleverly combines reading the next line into currentLine and verifying whether that's a null value though... so I'm not sure I'd recommend rewriting that part after all.

It's not clear what happens to your List<string[]> when you're done reading the file, so I'll assume you're just returning it. In that case, you could consider yielding the results instead:

private IEnumerable<string[]> ReadCsv(string fileName)
{
 char[] separator = new[] { ',' };
 string currentLine;
 using (var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
 using (var reader = new StreamReader(stream, Encoding.Default, true, 1024))
 {
 while ((currentLine = reader.ReadLine()) != null)
 {
 yield return currentLine.Split(separator, StringSplitOptions.None);
 }
 }
}

Amazing! Thank you so much! I was really not prepared for both the quality and dedication of your post.
Always a pleasure! Make sure you take @mjolka's answer into account as well, he's got very good points too :)

Stack Exchange Network

CSV reader using StreamReader and LINQ

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CSV reader using StreamReader and LINQ

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions