Mostly due to readability, I am using the following code to
- find a specific csv file on a drive
- read that CSV file using a streamReader
- parse it into a
List<string[]>
to be able to use LINQ and/or PLINQ later on
However, the process takes about 4-5 seconds, which is simply put too long. Any suggestions on how to improve the following (or maybe even replace it)?
var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel()
where x.Contains(dateTime)
select x).First();
#region Read CSV File
List<string[]> _tempList = new List<string[]>();
FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite);
using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))
{
string currentLine;
while ((currentLine = reader.ReadLine()) != null)
{
string[] temp = currentLine.Split(new char[] { ',' }, StringSplitOptions.None);
_tempList.Add(temp);
}
}
#endregion
The order inside the CSV file is important - the file will contain between (2000-7000) x 25 entries.
The CSV file has several fields which I am not interested in - and I don't necessarily need them in my List<string[]>
(or string[][]
). I have tried to filter them with a LINQ statement similar to the following pseudo code:
var query = from x in MyListOfStringArray
select new {Col1 = x[1] , Col2 = c[4]}.ToArray();
This approach results in a strange type for query - and for that matter was pretty slow (I would have an array with about 9-11 properties). Any ideas on that second issue?
2 Answers 2
var query = (from x in Directory.GetFiles(_path, "*.csv").AsParallel() where x.Contains(dateTime) select x).First();
If Directory.GetFiles(_path, "*.csv")
doesn't return any files where x.Contains(dateTime)
, this line will blow up with an InvalidOperationException that might be a bit off-putting / surprising.
I find the query would be clearer with a 100% method syntax - this would be equivalent:
var query = Directory.GetFiles(_path, "*.csv")
.AsParallel()
.Where(name => name.Contains(dateTime))
.First();
The name query
is wrong/misleading. It would be a query if it were an IQueryable<string>
(i.e. up to right after the .Where()
call)... but it's not, var
actually stands for string
and what you call query
is actually a fileName
: .First()
materializes the query, and returns the first element... or throws an exception.
It would be better to call .FirstOrDefault()
, which would make your result null
if no files matched your search criteria.
I don't understand why you're using var
here, given that you're using explicit types everywhere else.. especially since var
is just a string
. Don't get me wrong though: I prefer seeing var
here... and everywhere else.
#region Read CSV File
Using #region
inside a method clearly says your method is doing more than one thing. Whenever you do that, you're missing a refactoring opportunity. Consider extracting a method here:
private string GetFileName(string formattedDateTime)
{
return Directory.GetFiles(_path, "*.csv")
.AsParallel()
.Where(name => name.Contains(formattedDateTime))
.FirstOrDefault();
}
(lacking context to recommend passing in an actual DateTime
)
That leaves you with a method that (hopefully) now only has this #region
block, which is now completely redundant and can be discarded (if there's still more to it, extract methods as needed)
That said...
FileStream stream = new FileStream(query, FileMode.Open, FileAccess.Read, FileShare.ReadWrite); using (StreamReader reader = new StreamReader(stream, Encoding.Default, true, 1024))
Disposing the StreamReader
will also cleanly close and dispose the FileStream
, but that is far from obvious unless you know it, and if Jon Skeet says it's best practice to put it in its own using
block, then I can only agree with him.
Your code can use a variable (declared outside the loop's body) for the value separator: you're creating a new character array for every single row you're reading!
char[] separator = new char[] { ',' };
Or more succinctly:
var separator = new[] { ',' };
This makes the temp
assignment even clearer/cleaner:
string[] temp = currentLine.Split(separator, StringSplitOptions.None);
Looking at the currentLine
, it's far from obvious what's going on here:
while ((currentLine = reader.ReadLine()) != null)
Using the result of an assignment rarely results in easily readable code. I like how it cleverly combines reading the next line into currentLine
and verifying whether that's a null
value though... so I'm not sure I'd recommend rewriting that part after all.
It's not clear what happens to your List<string[]>
when you're done reading the file, so I'll assume you're just returning it. In that case, you could consider yield
ing the results instead:
private IEnumerable<string[]> ReadCsv(string fileName)
{
char[] separator = new[] { ',' };
string currentLine;
using (var stream = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite))
using (var reader = new StreamReader(stream, Encoding.Default, true, 1024))
{
while ((currentLine = reader.ReadLine()) != null)
{
yield return currentLine.Split(separator, StringSplitOptions.None);
}
}
}
-
4\$\begingroup\$ Amazing! Thank you so much! I was really not prepared for both the quality and dedication of your post. \$\endgroup\$Aeolus– Aeolus2014年09月05日 20:39:31 +00:00Commented Sep 5, 2014 at 20:39
-
\$\begingroup\$ Always a pleasure! Make sure you take @mjolka's answer into account as well, he's got very good points too :) \$\endgroup\$Mathieu Guindon– Mathieu Guindon2014年09月05日 20:47:02 +00:00Commented Sep 5, 2014 at 20:47
Encoding.Default
is possibly not the correct encoding for the file (and may change over time). The detectEncodingFromByteOrderMarks
parameter you pass to the StreamReader
constructor only helps if the file start with byte-order marks. You should know how your files are encoded, and pass the appropriate encoding, e.g. Encoding.UTF8
.
A more concise way to read the lines of a file is File.ReadLines
, though it doesn't specify ReadWrite
share access (which is something I'm not sure you want, either).
As for timing, I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s (not including query
). So my guess is that it's query
that is taking most of your time. Try it without AsParallel()
and see if that improves performance.
-
\$\begingroup\$ Thank you for your advise, I must admit that I don't know anything about encoding or what a byte-order marks is. I require the Share access - the file is being written to from someone else. \$\endgroup\$Aeolus– Aeolus2014年09月05日 20:41:39 +00:00Commented Sep 5, 2014 at 20:41
-
1\$\begingroup\$ @Aeolus Suppose you're writing code to read a video file. You would need to know the format of the file: is it AVI, MP4, Ogg, etc? Similarly, when reading a text file you need to know the file's encoding: is it UTF-8, UTF-16, US-ASCII, etc? \$\endgroup\$mjolka– mjolka2014年09月08日 04:01:56 +00:00Commented Sep 8, 2014 at 4:01
-
\$\begingroup\$ Is there an easy way to find out ? My very short analysis leads to the conclusion that 80% of all cases UTF-8 is a pretty solid guess. \$\endgroup\$Aeolus– Aeolus2014年09月13日 17:03:10 +00:00Commented Sep 13, 2014 at 17:03
query
)? I generated a test file of 7,000 lines, 25 entries per line, and your code runs in < 0.02s. \$\endgroup\$