EDIT: Thanks everyone for your answers, this is moving quickly now and it's always essential to get better with design and best practice!
I am loading in an excel file to a dataset using ExcelDataReader
. I then need to create a *.csv
file with the data.
Currently for the excel sheet I'm working with I have about 40 columns and 30k rows of data. The approach I'm taking takes about 2 minutes to complete.
The second snippet is the offender.
public static Stream SaveAsCsv(Stream excelFile, string filename)
{
MemoryStream newCSV = new MemoryStream();
string test = excelFile.ToString();
IExcelDataReader reader = null;
if (filename.EndsWith(".xls"))
{
reader = ExcelReaderFactory.CreateBinaryReader(excelFile);
}
else if (filename.EndsWith(".xlsx"))
{
reader = ExcelReaderFactory.CreateOpenXmlReader(excelFile);
}
//Read was empty so return a null stream in return
if (reader == null)
return newCSV;
var headers = new List<string>();
var ds = reader.AsDataSet(new ExcelDataSetConfiguration()
{
ConfigureDataTable = (tableReader) => new ExcelDataTableConfiguration()
{
UseHeaderRow = true,
ReadHeaderRow = rowReader =>
{
for (var i = 0; i < rowReader.FieldCount; i++)
headers.Add(Convert.ToString(rowReader.GetValue(i)));
},
FilterColumn = (columnReader, columnIndex) =>
!headers[columnIndex].ToString().ToUpper().Contains("SKIP")
}
});
var csvContent = string.Empty;
int colCount = ds.Tables[0].Columns.Count;
This for
operation is what takes two minutes to complete, everything else is not a problem:
for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
{
var arr = new List<string>();
if (row_no == 0)
{
for (int i = 0; i < colCount; i++)
{
arr.Add(ds.Tables[0].Columns[i].ColumnName);
}
}
else
{
object[] objarr = ds.Tables[0].Rows[row_no].ItemArray;
arr = ((IEnumerable)objarr).Cast<object>()
.Select(x => x.ToString())
.ToList();
}
csvContent += string.Join("|", arr) + "\n";
}
StreamWriter csv = new StreamWriter(newCSV);
csv.Write(csvContent);
csv.Flush();
csv.BaseStream.Seek(0, SeekOrigin.Begin);
return csv.BaseStream;
}
3 Answers 3
You should definitely follow @Henrik Hansen's suggestion about separating your method into multiple APIs with one exception.
The save method should just save the stream. There is no need for it return any streams. You should already know this stream from the CreateCsv
method.
Other things that you can improve are...
String concatenation
csvContent += string.Join("|", arr) + "\n";
I'm pretty sure this is the bottleneck of you application. Strings are immutable so in order to concatenate strings, the runtime has to copy the old one and appaned something new to it. Usually this is not a big issue but since you have about 40 columns and 30k rows of data this is really a lot of copying. And the string is growing, so each copy operation has to handle larger and larger strings.
For frequent string manipulation, especially in loops, you should use the StringBuilder
that does not requrie copying. There are numerous pages about this topic so if you are interested in its internals, just google for it.
As far as strings are concerned there is one more piece of code that isn't efficient yet.
arr = ((IEnumerable)objarr).Cast<object>() .Select(x => x.ToString()) .ToList();
You're materializing the objarr
by calling ToList
. This is not necessary at this point.
string.Join("|", arr)
Join
can take of enumerating it later only once. Currently your string is being created twice.
The new code could look this:
var csvBuilder = new StringBuilder();
for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
{
if (row_no == 0)
{
var headers = new List<string>();
for (int i = 0; i < colCount; i++)
{
headers.Add(ds.Tables[0].Columns[i].ColumnName);
}
csvBuilder.AppendLine(string.Join("|", headers));
}
else
{
var items = ds.Tables[0].Rows[row_no].ItemArray;
var values = ((IEnumerable)items).Cast<object>().Select(x => x.ToString());
csvBuilder.AppendLine(string.Join("|", values));
}
}
Always dispose streams
Another bad habit of yours is to not dispose streams. If you call this method a couple of time you'll be wasting a lot of memory. You should always do it at some point either with the using
statement or by calling Dispose()
in a try/finally
block.
Naming
You should pay more attention to your variable names. arr
is a terrible name because it can stand for just anything. It harms code readability. Code should document itself. This one doesn't.
You should also be more consistent. If you name most of your variables correctly with camelCase then don't name others with _snake_case_ like row_no
.
Use always strong names and as precise as you can. Not too long (if possible) but also not to short, avoid abbreviations. columnIndex
is a good example. row_no
should be rowIndex
. Currently it looks like more then one person was writing this code. Each one with different coding style.
{}
for (var i = 0; i < rowReader.FieldCount; i++) headers.Add(Convert.ToString(rowReader.GetValue(i)));
Always use {}
. You can save youself hours of debugging because without them it's very easy to make a mistake.
Explicit types vs var
You mix explit types and var
. You use them interchangeably. This makes your code look very unprovessional and messy. Pick one and stick to it. (I suggest picking var
).
-
2\$\begingroup\$ I think that the code would be even more efficient, if instead of building one huge string with
StringBuilder
, OP were to write directly into outputcsv
stream on every iteration. At the very least it should reduce the memory consumption for larger files. \$\endgroup\$Nikita B– Nikita B2018年04月20日 07:28:44 +00:00Commented Apr 20, 2018 at 7:28 -
\$\begingroup\$ @NikitaB oh, good catch... definitely!!! \$\endgroup\$t3chb0t– t3chb0t2018年04月20日 07:30:06 +00:00Commented Apr 20, 2018 at 7:30
I think, I would make an API with 3 overloads in order to split up the responsibility:
public enum ExcelFormat
{
Xls = 1,
Xlsx = 2,
}
public static Stream CreateCsv(string excelFileName)
{
if (string.IsNullOrWhiteSpace(excelFileName))
throw new ArgumentNullException("excelFileName");
ExcelFormat format;
string extension = Path.GetExtension(excelFileName).ToUpper();
switch (extension)
{
case ".XLS":
format = ExcelFormat.Xls;
break;
case ".XLSX":
format = ExcelFormat.Xlsx;
break;
default:
throw new FileFormatException("Invalid File Format or File Name");
}
using (Stream stream = File.OpenRead(excelFileName))
{
return CreateCsv(stream, format);
}
}
public static Stream CreateCsv(Stream excelStream, ExcelFormat format)
{
if (excelStream == null)
throw new ArgumentNullException(nameof(excelStream));
IExcelDataReader reader = null;
switch (format)
{
case ExcelFormat.Xls:
reader = ExcelReaderFactory.CreateBinaryReader(excelStream);
break;
case ExcelFormat.Xlsx:
reader = ExcelReaderFactory.CreateOpenXmlReader(excelStream);
break;
}
return CreateCsv(reader);
}
public static Stream CreateCsv(IExcelDataReader reader)
{
if (reader == null)
throw new ArgumentNullException(nameof(reader));
DataSet dataSet = ReadData(reader);
return WriteData(dataSet);
}
I think, this filter isn't very reliable (one day some one has a column name containing "skip" that should not be skipped):
FilterColumn = (columnReader, columnIndex) => !headers[columnIndex].ToString().ToUpper().Contains("SKIP")
If the valid column names are fairly static, I would make some kind of settings file, where the valid columns are defined:
<?xml version="1.0" encoding="utf-8" ?>
<settings>
<columns>
<colum>AAA</colum>
<colum>BBB</colum>
<colum>CCC</colum>
<colum>DDD</colum>
</columns>
</settings>
If the operation is user-interactive, then it would maybe probably be better with a dialog or wizard where they could select the valid columns.
About performance, I think t3chb0t has said it all :-)
Edit I've updated with
using (Stream stream = File.OpenRead(excelFileName))
{
return CreateCsv(stream, format);
}
You are having a bug here because you will always skip the first row of the datatable.
var csvContent = string.Empty; int colCount = ds.Tables[0].Columns.Count; for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++) { var arr = new List<string>(); if (row_no == 0) { for (int i = 0; i < colCount; i++) { arr.Add(ds.Tables[0].Columns[i].ColumnName); } } else { object[] objarr = ds.Tables[0].Rows[row_no].ItemArray; arr = ((IEnumerable)objarr).Cast<object>() .Select(x => x.ToString()) .ToList(); } csvContent += string.Join("|", arr) + "\n"; }
In addition why would you want to check for row_no == 0
for each datarow, which simply means checking this 30k times?
You should extract the retrieving of the columnnames to a separate method and if wanted int an extension method like so
public static IEnumerable<string> RetrieveColumnNames(this DataTable dataTable)
{
if (dataTable == null) { yield break; }
foreach (DataColumn column in dataTable.Columns)
{
yield return column.ColumnName;
}
}
This
object[] objarr = ds.Tables[0].Rows[row_no].ItemArray; arr = ((IEnumerable)objarr).Cast<object>() .Select(x => x.ToString()) .ToList();
is somehow strange itself. The ItemArray
property of a DataRow
is an object[]
and if you use string.Join()
with an object[]
the ToString()
method of each object will be called by it.
If reffering to a DataRow
or DataColumn
by using ds.Tables[0]
more than once you should introduce a separate variable which just hold a reference to ds.Tables[0]
.
Like @NikitaB mentioned in a comment to @t3chb0t answer
I think that the code would be even more efficient, if instead of building one huge string with
StringBuilder
, OP were to write directly into outputcsv
stream on every iteration. At the very least it should reduce the memory consumption for larger files.
Putting this all together the former loop could look like so
StreamWriter csv = new StreamWriter(newCSV);
var dataTable = ds.Tables[0];
csv.WriteLine(string.Join("|", dataTable.RetrieveColumnNames()));
for (int rowNo = 0; row_no < dataTable.Rows.Count; rowNo++)
{
object[] objarr = dataTable.Rows[rowNo].ItemArray;
csv.WriteLine(string.Join("|", objarr));
}
-
\$\begingroup\$ Mhmm... I looked at the example a couple of times but I'm still not sure where OP is skipping the first row... could you point me to the bug? \$\endgroup\$t3chb0t– t3chb0t2018年04月20日 07:52:50 +00:00Commented Apr 20, 2018 at 7:52
-
\$\begingroup\$ For
if (row_no == 0)
the OP only outputs the columnnames. Henceds.Tables[0].Rows[0]
won't be written to the output. \$\endgroup\$Heslacher– Heslacher2018年04月20日 07:57:14 +00:00Commented Apr 20, 2018 at 7:57 -
\$\begingroup\$ I believe there is no bug. OP is reading an excel table... the first row is probably column names ;-) \$\endgroup\$t3chb0t– t3chb0t2018年04月20日 07:58:18 +00:00Commented Apr 20, 2018 at 7:58
-
\$\begingroup\$ I know, but thats why OP sets
UseHeaderRow = true
in the configuration. And believe me, I have tested this ;-) \$\endgroup\$Heslacher– Heslacher2018年04月20日 07:59:28 +00:00Commented Apr 20, 2018 at 7:59 -
1\$\begingroup\$ Take a look at github.com/ExcelDataReader/ExcelDataReader \$\endgroup\$Heslacher– Heslacher2018年04月20日 08:10:41 +00:00Commented Apr 20, 2018 at 8:10
reader
is the memory stream. The for block is what takes the time to complete. \$\endgroup\$csvContent +=
to use aStringBuilder
and measure it one more time. You should notice a big improvement :-) \$\endgroup\$