Parsing Excel file into CSV file

Question 1

EDIT: Thanks everyone for your answers, this is moving quickly now and it's always essential to get better with design and best practice!

I am loading in an excel file to a dataset using ExcelDataReader. I then need to create a *.csv file with the data.

Currently for the excel sheet I'm working with I have about 40 columns and 30k rows of data. The approach I'm taking takes about 2 minutes to complete.

The second snippet is the offender.

public static Stream SaveAsCsv(Stream excelFile, string filename)
{
 MemoryStream newCSV = new MemoryStream();
 string test = excelFile.ToString();
 IExcelDataReader reader = null;
 if (filename.EndsWith(".xls"))
 {
 reader = ExcelReaderFactory.CreateBinaryReader(excelFile);
 }
 else if (filename.EndsWith(".xlsx"))
 {
 reader = ExcelReaderFactory.CreateOpenXmlReader(excelFile);
 }
 //Read was empty so return a null stream in return
 if (reader == null)
 return newCSV;
 var headers = new List<string>();
 var ds = reader.AsDataSet(new ExcelDataSetConfiguration()
 {
 ConfigureDataTable = (tableReader) => new ExcelDataTableConfiguration()
 {
 UseHeaderRow = true,
 ReadHeaderRow = rowReader =>
 {
 for (var i = 0; i < rowReader.FieldCount; i++)
 headers.Add(Convert.ToString(rowReader.GetValue(i)));
 },
 FilterColumn = (columnReader, columnIndex) =>
 !headers[columnIndex].ToString().ToUpper().Contains("SKIP")
 }
 });
 var csvContent = string.Empty;
 int colCount = ds.Tables[0].Columns.Count;

This for operation is what takes two minutes to complete, everything else is not a problem:

 for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
 {
 var arr = new List<string>();
 if (row_no == 0)
 {
 for (int i = 0; i < colCount; i++)
 {
 arr.Add(ds.Tables[0].Columns[i].ColumnName);
 }
 }
 else
 {
 object[] objarr = ds.Tables[0].Rows[row_no].ItemArray;
 arr = ((IEnumerable)objarr).Cast<object>()
 .Select(x => x.ToString())
 .ToList();
 }
 csvContent += string.Join("|", arr) + "\n";
 }
 StreamWriter csv = new StreamWriter(newCSV);
 csv.Write(csvContent);
 csv.Flush();
 csv.BaseStream.Seek(0, SeekOrigin.Begin);
 return csv.BaseStream;
}

Question 2

@t3chb0t This is almost the entire code for the operation, reader is the memory stream. The for block is what takes the time to complete.

Question 3

@t3chb0t I edited the question to make it more explicit.

Question 4

Okay thanks for the advice I have added the whole method and called out the part which is slowest based on measuring it with a stopwatch. The for loop takes ~2 minutes and the rest is instantaneous.

Question 5

Change this csvContent += to use a StringBuilder and measure it one more time. You should notice a big improvement :-)

Question 6

if no answer posted tomorrow - have patience ;-) Wait a couple of days, especially if weekend is comming...

Question 7

You should definitely follow @Henrik Hansen's suggestion about separating your method into multiple APIs with one exception.

The save method should just save the stream. There is no need for it return any streams. You should already know this stream from the CreateCsv method.

Other things that you can improve are...

String concatenation

csvContent += string.Join("|", arr) + "\n";

I'm pretty sure this is the bottleneck of you application. Strings are immutable so in order to concatenate strings, the runtime has to copy the old one and appaned something new to it. Usually this is not a big issue but since you have about 40 columns and 30k rows of data this is really a lot of copying. And the string is growing, so each copy operation has to handle larger and larger strings.

For frequent string manipulation, especially in loops, you should use the StringBuilder that does not requrie copying. There are numerous pages about this topic so if you are interested in its internals, just google for it.

As far as strings are concerned there is one more piece of code that isn't efficient yet.

arr = ((IEnumerable)objarr).Cast<object>()
 .Select(x => x.ToString())
 .ToList();

You're materializing the objarr by calling ToList. This is not necessary at this point.

string.Join("|", arr)

Join can take of enumerating it later only once. Currently your string is being created twice.

The new code could look this:

var csvBuilder = new StringBuilder();
for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
{
 if (row_no == 0)
 {
 var headers = new List<string>();
 for (int i = 0; i < colCount; i++)
 {
 headers.Add(ds.Tables[0].Columns[i].ColumnName);
 }
 csvBuilder.AppendLine(string.Join("|", headers));
 }
 else
 {
 var items = ds.Tables[0].Rows[row_no].ItemArray;
 var values = ((IEnumerable)items).Cast<object>().Select(x => x.ToString());
 csvBuilder.AppendLine(string.Join("|", values));
 }
}

Always dispose streams

Another bad habit of yours is to not dispose streams. If you call this method a couple of time you'll be wasting a lot of memory. You should always do it at some point either with the using statement or by calling Dispose() in a try/finally block.

Naming

You should pay more attention to your variable names. arr is a terrible name because it can stand for just anything. It harms code readability. Code should document itself. This one doesn't.

You should also be more consistent. If you name most of your variables correctly with camelCase then don't name others with _snake_case_ like row_no.

Use always strong names and as precise as you can. Not too long (if possible) but also not to short, avoid abbreviations. columnIndex is a good example. row_no should be rowIndex. Currently it looks like more then one person was writing this code. Each one with different coding style.

{}

for (var i = 0; i < rowReader.FieldCount; i++)
 headers.Add(Convert.ToString(rowReader.GetValue(i)));

Always use {}. You can save youself hours of debugging because without them it's very easy to make a mistake.

Explicit types vs var

You mix explit types and var. You use them interchangeably. This makes your code look very unprovessional and messy. Pick one and stick to it. (I suggest picking var).

Question 8

I think that the code would be even more efficient, if instead of building one huge string with StringBuilder, OP were to write directly into output csv stream on every iteration. At the very least it should reduce the memory consumption for larger files.

Question 9

@NikitaB oh, good catch... definitely!!!

Question 10

I think, I would make an API with 3 overloads in order to split up the responsibility:

 public enum ExcelFormat
 {
 Xls = 1,
 Xlsx = 2,
 }
 public static Stream CreateCsv(string excelFileName)
 {
 if (string.IsNullOrWhiteSpace(excelFileName))
 throw new ArgumentNullException("excelFileName");
 ExcelFormat format;
 string extension = Path.GetExtension(excelFileName).ToUpper();
 switch (extension)
 {
 case ".XLS":
 format = ExcelFormat.Xls;
 break;
 case ".XLSX":
 format = ExcelFormat.Xlsx;
 break;
 default:
 throw new FileFormatException("Invalid File Format or File Name");
 }
 using (Stream stream = File.OpenRead(excelFileName))
 {
 return CreateCsv(stream, format);
 }
 }
 public static Stream CreateCsv(Stream excelStream, ExcelFormat format)
 {
 if (excelStream == null)
 throw new ArgumentNullException(nameof(excelStream));
 IExcelDataReader reader = null;
 switch (format)
 {
 case ExcelFormat.Xls:
 reader = ExcelReaderFactory.CreateBinaryReader(excelStream);
 break;
 case ExcelFormat.Xlsx:
 reader = ExcelReaderFactory.CreateOpenXmlReader(excelStream);
 break;
 }
 return CreateCsv(reader);
 }
 public static Stream CreateCsv(IExcelDataReader reader)
 {
 if (reader == null)
 throw new ArgumentNullException(nameof(reader));
 DataSet dataSet = ReadData(reader);
 return WriteData(dataSet);
 }

I think, this filter isn't very reliable (one day some one has a column name containing "skip" that should not be skipped):

FilterColumn = (columnReader, columnIndex) => !headers[columnIndex].ToString().ToUpper().Contains("SKIP")

If the valid column names are fairly static, I would make some kind of settings file, where the valid columns are defined:

<?xml version="1.0" encoding="utf-8" ?>
<settings>
 <columns>
 <colum>AAA</colum>
 <colum>BBB</colum>
 <colum>CCC</colum>
 <colum>DDD</colum>
 </columns>
</settings>

If the operation is user-interactive, then it would maybe probably be better with a dialog or wizard where they could select the valid columns.

About performance, I think t3chb0t has said it all :-)

Edit I've updated with

 using (Stream stream = File.OpenRead(excelFileName))
 {
 return CreateCsv(stream, format);
 }

Question 11

You are having a bug here because you will always skip the first row of the datatable.

var csvContent = string.Empty;
int colCount = ds.Tables[0].Columns.Count;
for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
{
 var arr = new List<string>();
 if (row_no == 0)
 {
 for (int i = 0; i < colCount; i++)
 {
 arr.Add(ds.Tables[0].Columns[i].ColumnName);
 }
 }
 else
 {
 object[] objarr = ds.Tables[0].Rows[row_no].ItemArray;
 arr = ((IEnumerable)objarr).Cast<object>()
 .Select(x => x.ToString())
 .ToList();
 }
 csvContent += string.Join("|", arr) + "\n";
}

In addition why would you want to check for row_no == 0 for each datarow, which simply means checking this 30k times?

You should extract the retrieving of the columnnames to a separate method and if wanted int an extension method like so

public static IEnumerable<string> RetrieveColumnNames(this DataTable dataTable)
{
 if (dataTable == null) { yield break; }
 foreach (DataColumn column in dataTable.Columns)
 {
 yield return column.ColumnName;
 }
}

This

 object[] objarr = ds.Tables[0].Rows[row_no].ItemArray;
 arr = ((IEnumerable)objarr).Cast<object>()
 .Select(x => x.ToString())
 .ToList();

is somehow strange itself. The ItemArray property of a DataRow is an object[] and if you use string.Join() with an object[] the ToString() method of each object will be called by it.

If reffering to a DataRow or DataColumn by using ds.Tables[0] more than once you should introduce a separate variable which just hold a reference to ds.Tables[0].

Like @NikitaB mentioned in a comment to @t3chb0t answer

I think that the code would be even more efficient, if instead of building one huge string with StringBuilder, OP were to write directly into output csv stream on every iteration. At the very least it should reduce the memory consumption for larger files.

Putting this all together the former loop could look like so

StreamWriter csv = new StreamWriter(newCSV);
var dataTable = ds.Tables[0];
csv.WriteLine(string.Join("|", dataTable.RetrieveColumnNames()));
for (int rowNo = 0; row_no < dataTable.Rows.Count; rowNo++)
{
 object[] objarr = dataTable.Rows[rowNo].ItemArray;
 csv.WriteLine(string.Join("|", objarr));
}

Question 12

Mhmm... I looked at the example a couple of times but I'm still not sure where OP is skipping the first row... could you point me to the bug?

Question 13

For if (row_no == 0) the OP only outputs the columnnames. Hence ds.Tables[0].Rows[0] won't be written to the output.

Question 14

I believe there is no bug. OP is reading an excel table... the first row is probably column names ;-)

Question 15

I know, but thats why OP sets UseHeaderRow = true in the configuration. And believe me, I have tested this ;-)

Question 16

Take a look at github.com/ExcelDataReader/ExcelDataReader

t3chb0t t3chb0tt3chb0t 44.6k9 gold badges84 silver badges190 bronze badges · Accepted Answer · 2018-04-20 07:12:32Z

You should definitely follow @Henrik Hansen's suggestion about separating your method into multiple APIs with one exception.

The save method should just save the stream. There is no need for it return any streams. You should already know this stream from the CreateCsv method.

Other things that you can improve are...

String concatenation

csvContent += string.Join("|", arr) + "\n";

I'm pretty sure this is the bottleneck of you application. Strings are immutable so in order to concatenate strings, the runtime has to copy the old one and appaned something new to it. Usually this is not a big issue but since you have about 40 columns and 30k rows of data this is really a lot of copying. And the string is growing, so each copy operation has to handle larger and larger strings.

For frequent string manipulation, especially in loops, you should use the StringBuilder that does not requrie copying. There are numerous pages about this topic so if you are interested in its internals, just google for it.

As far as strings are concerned there is one more piece of code that isn't efficient yet.

arr = ((IEnumerable)objarr).Cast<object>()
 .Select(x => x.ToString())
 .ToList();

You're materializing the objarr by calling ToList. This is not necessary at this point.

string.Join("|", arr)

Join can take of enumerating it later only once. Currently your string is being created twice.

The new code could look this:

var csvBuilder = new StringBuilder();
for (int row_no = 0; row_no < ds.Tables[0].Rows.Count; row_no++)
{
 if (row_no == 0)
 {
 var headers = new List<string>();
 for (int i = 0; i < colCount; i++)
 {
 headers.Add(ds.Tables[0].Columns[i].ColumnName);
 }
 csvBuilder.AppendLine(string.Join("|", headers));
 }
 else
 {
 var items = ds.Tables[0].Rows[row_no].ItemArray;
 var values = ((IEnumerable)items).Cast<object>().Select(x => x.ToString());
 csvBuilder.AppendLine(string.Join("|", values));
 }
}

Always dispose streams

Another bad habit of yours is to not dispose streams. If you call this method a couple of time you'll be wasting a lot of memory. You should always do it at some point either with the using statement or by calling Dispose() in a try/finally block.

Naming

You should pay more attention to your variable names. arr is a terrible name because it can stand for just anything. It harms code readability. Code should document itself. This one doesn't.

You should also be more consistent. If you name most of your variables correctly with camelCase then don't name others with _snake_case_ like row_no.

Use always strong names and as precise as you can. Not too long (if possible) but also not to short, avoid abbreviations. columnIndex is a good example. row_no should be rowIndex. Currently it looks like more then one person was writing this code. Each one with different coding style.

{}

for (var i = 0; i < rowReader.FieldCount; i++)
 headers.Add(Convert.ToString(rowReader.GetValue(i)));

Always use {}. You can save youself hours of debugging because without them it's very easy to make a mistake.

Explicit types vs var

You mix explit types and var. You use them interchangeably. This makes your code look very unprovessional and messy. Pick one and stick to it. (I suggest picking var).

I think that the code would be even more efficient, if instead of building one huge string with StringBuilder, OP were to write directly into output csv stream on every iteration. At the very least it should reduce the memory consumption for larger files.

Stack Exchange Network

Parsing Excel file into CSV file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing Excel file into CSV file

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions