Parsing information from Excel files

Question 1

I have to parse information from several Excel files (.xls, .xlsx). The structure of the files is nearly identical except for the columns order. In this case it is # Number Name but in other cases it might be # Name Number.

Sample .xlsx file can be found here (Google Spreadsheets).

Assumptions:

Industry name: ^Industry: (.*)$. I'm fine with the current regex expression.
List of the people:
- First of all, it should be List<Person> instead of List<string>, because Person consists of "Number" and "Name", not just the name.
- my assumptions here are wrong because it is looking for the 6-digit length number and then it relies on the column order because it actually makes the assumption that "Name" is the next column to the "Number". Perhaps this should be replaced with simple regex expressions such as Number = ^[0-9]{6}$ (always 6 digits), Name = ^([a-zA-Z]+\s?\b){2,}$ (at least 2 words separated by spaces because there are some people with FirstName LastName and there are others with FirstName MiddleName LastName; however, the names can also be in cyrillic; there are names such as Anna-Maria i.e. including hypens)

I would like to get a review because the code looks so bad, it doesn't meet any principles such as Keep it Simple, Stupid (KISS), Do not repeat yourself (DRY), etc. and most importantly, it is not testable. Once I put the logic for exporting to Excel files, I would be making unit tests for the import and the parsing, which is not possible with the current code base.

var excel = new ExcelParser();
var sheet1 = excel.Import(@"test.xlsx");
var result = ParseSheet(sheet1);
Console.ReadLine();
static (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet1)
{
 Console.OutputEncoding = Encoding.UTF8; // needed for the cyrillic names
 var industryRegex = new Regex("^Industry: (.*)$");
 // 1. Get Indices of industry cell and first Name in people names..
 var industryCellIndex = (-1, -1, false);
 var peopleFirstCellIndex = (-1, -1, false);
 for (var i = 0; i < sheet1.Rows.Count; i++)
 {
 for (var j = 0; j < sheet1.Columns.Count; j++)
 {
 var cell = sheet1.Rows[i][j].ToString()?.Trim();
 // match Industry
 var matches = industryRegex.Match(cell);
 if (matches.Success)
 {
 var industryName2 = matches.Groups[1];
 industryCellIndex = (i, j, true);
 break;
 }
 // the name after the first 6-digits number cell will be the first name in people records
 if (cell.Length == 6 && int.TryParse(cell, out _))
 {
 peopleFirstCellIndex = (i, j + 1, true);
 break;
 }
 }
 if (industryCellIndex.Item3 && peopleFirstCellIndex.Item3)
 break;
 }
 if (!industryCellIndex.Item3 || !peopleFirstCellIndex.Item3)
 {
 throw new Exception("Excel file is not normalized!");
 }
 // 2. retrieve the desired data
 var industryName = industryRegex.Match(sheet1.Rows[industryCellIndex.Item1][industryCellIndex.Item2].ToString()?.Trim()).Groups[1].Value;
 var peopleNames = new List<string>();
 var colIndex = peopleFirstCellIndex.Item2;
 for (var rowIndex = peopleFirstCellIndex.Item1;
 rowIndex < sheet1.Rows.Count;
 rowIndex++)
 {
 peopleNames.Add(sheet1.Rows[rowIndex][colIndex].ToString()?.Trim());
 }
 return (industryName, peopleNames);
}
public class Person
{
 public string Number { get; init; } = default!;
 public string Name { get; init; } = default!;
}
public sealed class ExcelParser
{
 public ExcelParser()
 {
 Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
 }
 public DataTable Import(string filePath)
 {
 // does file exist?
 if (!File.Exists(filePath))
 {
 throw new FileNotFoundException();
 }
 // .xls or .xlsx allowed
 var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
 if (extension is not (".xls" or ".xlsx"))
 {
 throw new NotSupportedException();
 }
 // read .xls or .xlsx
 using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
 using var reader = ExcelReaderFactory.CreateReader(stream);
 var dataSet = reader.AsDataSet(new ExcelDataSetConfiguration
 {
 ConfigureDataTable = _ => new ExcelDataTableConfiguration
 {
 UseHeaderRow = false
 }
 });
 // Sheet1
 return dataSet.Tables[0];
 }
}

Question 2

Which nuget package are you using for ExcelReaderFactory? ExcelDataReader and ExcelDataReader.DataSet?

Question 3

@PeterCsala, yup. These are the packages.

Question 4

`ExcelParser`

`Import`

Please do not use the comments to echo what you are about to do
Use the comments to capture the why or why not which can't be read from the code

// does file exist?
if (!File.Exists(filePath))
{
 throw new FileNotFoundException();
}

I would suggest a different name for this method like LoadSheet1IntoDataTable or GetSheet1AsDataTable or ...
Please try to avoid magic numbers, please prefer constants

const int SheetOneIndex = 0;
...
return dataSet.Tables[SheetOneIndex]

const int SheetOneIndex = 0;
const string Xls = ".xls", Xlsx = ".xlsx";
public DataTable GetSheet1AsDataTable(string filePath)
{
 if (!File.Exists(filePath))
 throw new FileNotFoundException();
 var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
 if (extension is not (Xls or Xlsx))
 throw new NotSupportedException();
 using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
 using var reader = ExcelReaderFactory.CreateReader(stream);
 var dataSet = reader.AsDataSet(new () { ConfigureDataTable = _ => new () { UseHeaderRow = false } });
 return dataSet.Tables[SheetOneIndex];
}

`ParseSheet`

I would suggest to create several smaller functions to make your code more legible
First, I would introduce the following two tester-doer helpers

static Regex industryRegex = new Regex("^Industry: (.*)$", RegexOptions.Compiled);
const int CaptureGroupIndex = 1;
static bool TryGetIndustryName(string cell, out string industryName)
{
 var matches = industryRegex.Match(cell);
 industryName = matches.Success ? matches.Groups[CaptureGroupIndex].ToString() : default;
 return matches.Success;
}

const int SixDigitLong = 6;
static bool TryGetFirstPersonIndex(string cell, int row, int column, out (int Row, int Column) firstPersonIndex)
{
 var isASixDigitLongInteger = cell.Length == SixDigitLong && int.TryParse(cell, out _);
 //+1 is needed to point to "Name" column rather than "Number"
 firstPersonIndex = isASixDigitLongInteger ? (row, column + 1) : default; 
 return isASixDigitLongInteger;
}

With these in our hand the first part of ParseSheet can be achieved like this
- Instead of using two triple tuple I have declared two more simple variable
- I have used two foreach loops inside of two fors
- I also made the cell checks lazy (evaluate only if we haven't found yet)

static (string IndustryName, (int Row, int Column) FirstPersonIndex) GetIndustryAndFirstPersonIndex(DataTable sheet)
{
 string industryName = default;
 (int Row, int Column) firstPersonIndex = default;
 foreach (DataRow row in sheet.Rows)
 foreach (DataColumn column in sheet.Columns)
 {
 if (industryName != default && firstPersonIndex != default)
 return (industryName, firstPersonIndex);
 var cell = row[column].ToString().Trim();
 if (industryName == default
 && TryGetIndustryName(cell, out industryName))
 break;
 if (firstPersonIndex == default
 && TryGetFirstPersonIndex(cell, sheet.Rows.IndexOf(row), sheet.Columns.IndexOf(column), out firstPersonIndex))
 break;
 }
 throw new Exception("Excel file is not normalized!");
}

The second part of the ParseSheet can be achieved with the following Linq:
- Please bear in mind the .Range requires start and count parameters not start and end

static (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet)
{
 var (industryName, firstPersonIndex) = GetIndustryAndFirstPersonIndex(sheet);
 var peopleNames = Enumerable.Range(firstPersonIndex.Row, sheet.Rows.Count - firstPersonIndex.Row)
 .Select(rowIndex => sheet.Rows[rowIndex][firstPersonIndex.Column].ToString()?.Trim())
 .ToList();
 return (industryName, peopleNames);
}

Question 5

Thank you for the comprehensive answer! There are two things tho. 1) Instead of returning List<string> peopleNames, it has to be List<Person>. That's why I thought it would be better to use regex expressions because I need the name and the number as well. 2) The first person index is not always "number's column + 1" because the columns might be ordered in a different way, so the regex expression would be necessary here.

Question 6

Hi @nop, I have reviewed the code which you have provided. If that's not working as intended then we should not review it. So, is it working as expected?

Question 7

True, I'll try to do it myself or post another question. Thanks! :)

Peter Csala Peter Csala 10.7k1 gold badge16 silver badges36 bronze badges · Accepted Answer · 2022-10-03 13:22:50Z

`ExcelParser`

`Import`

Please do not use the comments to echo what you are about to do
Use the comments to capture the why or why not which can't be read from the code

// does file exist?
if (!File.Exists(filePath))
{
 throw new FileNotFoundException();
}

I would suggest a different name for this method like LoadSheet1IntoDataTable or GetSheet1AsDataTable or ...
Please try to avoid magic numbers, please prefer constants

const int SheetOneIndex = 0;
...
return dataSet.Tables[SheetOneIndex]

const int SheetOneIndex = 0;
const string Xls = ".xls", Xlsx = ".xlsx";
public DataTable GetSheet1AsDataTable(string filePath)
{
 if (!File.Exists(filePath))
 throw new FileNotFoundException();
 var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
 if (extension is not (Xls or Xlsx))
 throw new NotSupportedException();
 using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
 using var reader = ExcelReaderFactory.CreateReader(stream);
 var dataSet = reader.AsDataSet(new () { ConfigureDataTable = _ => new () { UseHeaderRow = false } });
 return dataSet.Tables[SheetOneIndex];
}

`ParseSheet`

I would suggest to create several smaller functions to make your code more legible
First, I would introduce the following two tester-doer helpers

static Regex industryRegex = new Regex("^Industry: (.*)$", RegexOptions.Compiled);
const int CaptureGroupIndex = 1;
static bool TryGetIndustryName(string cell, out string industryName)
{
 var matches = industryRegex.Match(cell);
 industryName = matches.Success ? matches.Groups[CaptureGroupIndex].ToString() : default;
 return matches.Success;
}

const int SixDigitLong = 6;
static bool TryGetFirstPersonIndex(string cell, int row, int column, out (int Row, int Column) firstPersonIndex)
{
 var isASixDigitLongInteger = cell.Length == SixDigitLong && int.TryParse(cell, out _);
 //+1 is needed to point to "Name" column rather than "Number"
 firstPersonIndex = isASixDigitLongInteger ? (row, column + 1) : default; 
 return isASixDigitLongInteger;
}

With these in our hand the first part of ParseSheet can be achieved like this
- Instead of using two triple tuple I have declared two more simple variable
- I have used two foreach loops inside of two fors
- I also made the cell checks lazy (evaluate only if we haven't found yet)

static (string IndustryName, (int Row, int Column) FirstPersonIndex) GetIndustryAndFirstPersonIndex(DataTable sheet)
{
 string industryName = default;
 (int Row, int Column) firstPersonIndex = default;
 foreach (DataRow row in sheet.Rows)
 foreach (DataColumn column in sheet.Columns)
 {
 if (industryName != default && firstPersonIndex != default)
 return (industryName, firstPersonIndex);
 var cell = row[column].ToString().Trim();
 if (industryName == default
 && TryGetIndustryName(cell, out industryName))
 break;
 if (firstPersonIndex == default
 && TryGetFirstPersonIndex(cell, sheet.Rows.IndexOf(row), sheet.Columns.IndexOf(column), out firstPersonIndex))
 break;
 }
 throw new Exception("Excel file is not normalized!");
}

The second part of the ParseSheet can be achieved with the following Linq:
- Please bear in mind the .Range requires start and count parameters not start and end

static (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet)
{
 var (industryName, firstPersonIndex) = GetIndustryAndFirstPersonIndex(sheet);
 var peopleNames = Enumerable.Range(firstPersonIndex.Row, sheet.Rows.Count - firstPersonIndex.Row)
 .Select(rowIndex => sheet.Rows[rowIndex][firstPersonIndex.Column].ToString()?.Trim())
 .ToList();
 return (industryName, peopleNames);
}

Thank you for the comprehensive answer! There are two things tho. 1) Instead of returning List<string> peopleNames, it has to be List<Person>. That's why I thought it would be better to use regex expressions because I need the name and the number as well. 2) The first person index is not always "number's column + 1" because the columns might be ordered in a different way, so the regex expression would be necessary here.
Hi @nop, I have reviewed the code which you have provided. If that's not working as intended then we should not review it. So, is it working as expected?
True, I'll try to do it myself or post another question. Thanks! :)

Stack Exchange Network

Parsing information from Excel files

1 Answer 1

`ExcelParser`

`Import`

`ParseSheet`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing information from Excel files

1 Answer 1

ExcelParser

Import

ParseSheet

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`ExcelParser`

`Import`

`ParseSheet`