12
\$\begingroup\$

Mostly due to readability, I am using the following code to

  1. find a specific csv file on a drive
  2. read that CSV file using a streamReader
  3. parse it into a List<string[]> to be able to use LINQ and/or PLINQ later on

However, the process takes about 4-5 seconds, which is simply put too long. Any suggestions on how to improve the following (or maybe even replace it)?

var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
 where x.Contains(dateTime)
 select x).First();
 #region Read CSV File
 List<string[]> _tempList = new List<string[]>();
 FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
 using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))
 {
 string currentLine;
 while ((currentLine = reader.ReadLine()) != null)
 {
 string[] temp = currentLine.Split(new char[] { ',' }, StringSplitOptions.None);
 _tempList.Add(temp);
 }
 }
 #endregion

The order inside the CSV file is important - the file will contain between (2000-7000) x 25 entries.

The CSV file has several fields which I am not interested in - and I don't necessarily need them in my List<string[]> (or string[][]). I have tried to filter them with a LINQ statement similar to the following pseudo code:

var query = from x in MyListOfStringArray
 select new {Col1 = x[1] , Col2 = c[4]}.ToArray();

This approach results in a strange type for query - and for that matter was pretty slow (I would have an array with about 9-11 properties). Any ideas on that second issue?

Heslacher
50.9k5 gold badges83 silver badges177 bronze badges
asked Sep 4, 2014 at 20:28
\$\endgroup\$
1
  • 1
    \$\begingroup\$ Are you sure you're timing the correct part of the code (ie not including query)? I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s. \$\endgroup\$ Commented Sep 5, 2014 at 1:49

2 Answers 2

9
\$\begingroup\$
var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
 where x.Contains(dateTime)
 select x).First();

If Directory.GetFiles(_path, "*.csv") doesn't return any files where x.Contains(dateTime), this line will blow up with an InvalidOperationException that might be a bit off-putting / surprising.

I find the query would be clearer with a 100% method syntax - this would be equivalent:

var query = Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(dateTime))
 .First();

The name query is wrong/misleading. It would be a query if it were an IQueryable<string> (i.e. up to right after the .Where() call)... but it's not, var actually stands for string and what you call query is actually a fileName: .First() materializes the query, and returns the first element... or throws an exception.

It would be better to call .FirstOrDefault(), which would make your result null if no files matched your search criteria.

I don't understand why you're using var here, given that you're using explicit types everywhere else.. especially since var is just a string. Don't get me wrong though: I prefer seeing var here... and everywhere else.


#region Read CSV File

Using #region inside a method clearly says your method is doing more than one thing. Whenever you do that, you're missing a refactoring opportunity. Consider extracting a method here:

private string GetFileName(string formattedDateTime)
{
 return Directory.GetFiles(_path, "*.csv")
 .AsParallel()
 .Where(name => name.Contains(formattedDateTime))
 .FirstOrDefault();
}

(lacking context to recommend passing in an actual DateTime)

That leaves you with a method that (hopefully) now only has this #region block, which is now completely redundant and can be discarded (if there's still more to it, extract methods as needed)

That said...

FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))

Disposing the StreamReader will also cleanly close and dispose the FileStream, but that is far from obvious unless you know it, and if Jon Skeet says it's best practice to put it in its own using block, then I can only agree with him.


Your code can use a variable (declared outside the loop's body) for the value separator: you're creating a new character array for every single row you're reading!

char[] separator = new char[] { ',' };

Or more succinctly:

var separator = new[] { ',' };

This makes the temp assignment even clearer/cleaner:

string[] temp = currentLine.Split(separator, StringSplitOptions.None);

Looking at the currentLine, it's far from obvious what's going on here:

while ((currentLine = reader.ReadLine()) != null)

Using the result of an assignment rarely results in easily readable code. I like how it cleverly combines reading the next line into currentLine and verifying whether that's a null value though... so I'm not sure I'd recommend rewriting that part after all.


It's not clear what happens to your List<string[]> when you're done reading the file, so I'll assume you're just returning it. In that case, you could consider yielding the results instead:

private IEnumerable<string[]> ReadCsv(string fileName)
{
 char[] separator = new[] { ',' };
 string currentLine;
 using (var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
 using (var reader = new StreamReader(stream, Encoding.Default, true, 1024))
 {
 while ((currentLine = reader.ReadLine()) != null)
 {
 yield return currentLine.Split(separator, StringSplitOptions.None);
 }
 }
}
answered Sep 5, 2014 at 1:53
\$\endgroup\$
2
  • 4
    \$\begingroup\$ Amazing! Thank you so much! I was really not prepared for both the quality and dedication of your post. \$\endgroup\$ Commented Sep 5, 2014 at 20:39
  • \$\begingroup\$ Always a pleasure! Make sure you take @mjolka's answer into account as well, he's got very good points too :) \$\endgroup\$ Commented Sep 5, 2014 at 20:47
6
\$\begingroup\$

Encoding.Default is possibly not the correct encoding for the file (and may change over time). The detectEncodingFromByteOrderMarks parameter you pass to the StreamReader constructor only helps if the file start with byte-order marks. You should know how your files are encoded, and pass the appropriate encoding, e.g. Encoding.UTF8.

A more concise way to read the lines of a file is File.ReadLines, though it doesn't specify ReadWrite share access (which is something I'm not sure you want, either).

As for timing, I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s (not including query). So my guess is that it's query that is taking most of your time. Try it without AsParallel() and see if that improves performance.

answered Sep 5, 2014 at 2:23
\$\endgroup\$
3
  • \$\begingroup\$ Thank you for your advise, I must admit that I don't know anything about encoding or what a byte-order marks is. I require the Share access - the file is being written to from someone else. \$\endgroup\$ Commented Sep 5, 2014 at 20:41
  • 1
    \$\begingroup\$ @Aeolus Suppose you're writing code to read a video file. You would need to know the format of the file: is it AVI, MP4, Ogg, etc? Similarly, when reading a text file you need to know the file's encoding: is it UTF-8, UTF-16, US-ASCII, etc? \$\endgroup\$ Commented Sep 8, 2014 at 4:01
  • \$\begingroup\$ Is there an easy way to find out ? My very short analysis leads to the conclusion that 80% of all cases UTF-8 is a pretty solid guess. \$\endgroup\$ Commented Sep 13, 2014 at 17:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.