I have to parse information from several Excel files (.xls, .xlsx). The structure of the files is nearly identical except for the columns order. In this case it is # Number Name
but in other cases it might be # Name Number
.
Sample .xlsx file can be found here (Google Spreadsheets).
Assumptions:
- Industry name:
^Industry: (.*)$
. I'm fine with the current regex expression. - List of the people:
- First of all, it should be
List<Person>
instead ofList<string>
, becausePerson
consists of "Number" and "Name", not just the name. - my assumptions here are wrong because it is looking for the 6-digit length number and then it relies on the column order because it actually makes the assumption that "Name" is the next column to the "Number". Perhaps this should be replaced with simple regex expressions such as Number =
^[0-9]{6}$
(always 6 digits), Name =^([a-zA-Z]+\s?\b){2,}$
(at least 2 words separated by spaces because there are some people with FirstName LastName and there are others with FirstName MiddleName LastName; however, the names can also be in cyrillic; there are names such as Anna-Maria i.e. including hypens)
- First of all, it should be
I would like to get a review because the code looks so bad, it doesn't meet any principles such as Keep it Simple, Stupid (KISS), Do not repeat yourself (DRY), etc. and most importantly, it is not testable. Once I put the logic for exporting to Excel files, I would be making unit tests for the import and the parsing, which is not possible with the current code base.
var excel = new ExcelParser();
var sheet1 = excel.Import(@"test.xlsx");
var result = ParseSheet(sheet1);
Console.ReadLine();
static (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet1)
{
Console.OutputEncoding = Encoding.UTF8; // needed for the cyrillic names
var industryRegex = new Regex("^Industry: (.*)$");
// 1. Get Indices of industry cell and first Name in people names..
var industryCellIndex = (-1, -1, false);
var peopleFirstCellIndex = (-1, -1, false);
for (var i = 0; i < sheet1.Rows.Count; i++)
{
for (var j = 0; j < sheet1.Columns.Count; j++)
{
var cell = sheet1.Rows[i][j].ToString()?.Trim();
// match Industry
var matches = industryRegex.Match(cell);
if (matches.Success)
{
var industryName2 = matches.Groups[1];
industryCellIndex = (i, j, true);
break;
}
// the name after the first 6-digits number cell will be the first name in people records
if (cell.Length == 6 && int.TryParse(cell, out _))
{
peopleFirstCellIndex = (i, j + 1, true);
break;
}
}
if (industryCellIndex.Item3 && peopleFirstCellIndex.Item3)
break;
}
if (!industryCellIndex.Item3 || !peopleFirstCellIndex.Item3)
{
throw new Exception("Excel file is not normalized!");
}
// 2. retrieve the desired data
var industryName = industryRegex.Match(sheet1.Rows[industryCellIndex.Item1][industryCellIndex.Item2].ToString()?.Trim()).Groups[1].Value;
var peopleNames = new List<string>();
var colIndex = peopleFirstCellIndex.Item2;
for (var rowIndex = peopleFirstCellIndex.Item1;
rowIndex < sheet1.Rows.Count;
rowIndex++)
{
peopleNames.Add(sheet1.Rows[rowIndex][colIndex].ToString()?.Trim());
}
return (industryName, peopleNames);
}
public class Person
{
public string Number { get; init; } = default!;
public string Name { get; init; } = default!;
}
public sealed class ExcelParser
{
public ExcelParser()
{
Encoding.RegisterProvider(CodePagesEncodingProvider.Instance);
}
public DataTable Import(string filePath)
{
// does file exist?
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
// .xls or .xlsx allowed
var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
if (extension is not (".xls" or ".xlsx"))
{
throw new NotSupportedException();
}
// read .xls or .xlsx
using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
using var reader = ExcelReaderFactory.CreateReader(stream);
var dataSet = reader.AsDataSet(new ExcelDataSetConfiguration
{
ConfigureDataTable = _ => new ExcelDataTableConfiguration
{
UseHeaderRow = false
}
});
// Sheet1
return dataSet.Tables[0];
}
}
1 Answer 1
ExcelParser
Import
- Please do not use the comments to echo what you are about to do
- Use the comments to capture the why or why not which can't be read from the code
// does file exist?
if (!File.Exists(filePath))
{
throw new FileNotFoundException();
}
- I would suggest a different name for this method like
LoadSheet1IntoDataTable
orGetSheet1AsDataTable
or ... - Please try to avoid magic numbers, please prefer constants
const int SheetOneIndex = 0;
...
return dataSet.Tables[SheetOneIndex]
const int SheetOneIndex = 0;
const string Xls = ".xls", Xlsx = ".xlsx";
public DataTable GetSheet1AsDataTable(string filePath)
{
if (!File.Exists(filePath))
throw new FileNotFoundException();
var extension = new FileInfo(filePath).Extension.ToLowerInvariant();
if (extension is not (Xls or Xlsx))
throw new NotSupportedException();
using var stream = File.Open(filePath, FileMode.Open, FileAccess.Read);
using var reader = ExcelReaderFactory.CreateReader(stream);
var dataSet = reader.AsDataSet(new () { ConfigureDataTable = _ => new () { UseHeaderRow = false } });
return dataSet.Tables[SheetOneIndex];
}
ParseSheet
- I would suggest to create several smaller functions to make your code more legible
- First, I would introduce the following two tester-doer helpers
static Regex industryRegex = new Regex("^Industry: (.*)$", RegexOptions.Compiled);
const int CaptureGroupIndex = 1;
static bool TryGetIndustryName(string cell, out string industryName)
{
var matches = industryRegex.Match(cell);
industryName = matches.Success ? matches.Groups[CaptureGroupIndex].ToString() : default;
return matches.Success;
}
const int SixDigitLong = 6;
static bool TryGetFirstPersonIndex(string cell, int row, int column, out (int Row, int Column) firstPersonIndex)
{
var isASixDigitLongInteger = cell.Length == SixDigitLong && int.TryParse(cell, out _);
//+1 is needed to point to "Name" column rather than "Number"
firstPersonIndex = isASixDigitLongInteger ? (row, column + 1) : default;
return isASixDigitLongInteger;
}
- With these in our hand the first part of
ParseSheet
can be achieved like this- Instead of using two triple tuple I have declared two more simple variable
- I have used two
foreach
loops inside of twofor
s - I also made the cell checks lazy (evaluate only if we haven't found yet)
static (string IndustryName, (int Row, int Column) FirstPersonIndex) GetIndustryAndFirstPersonIndex(DataTable sheet)
{
string industryName = default;
(int Row, int Column) firstPersonIndex = default;
foreach (DataRow row in sheet.Rows)
foreach (DataColumn column in sheet.Columns)
{
if (industryName != default && firstPersonIndex != default)
return (industryName, firstPersonIndex);
var cell = row[column].ToString().Trim();
if (industryName == default
&& TryGetIndustryName(cell, out industryName))
break;
if (firstPersonIndex == default
&& TryGetFirstPersonIndex(cell, sheet.Rows.IndexOf(row), sheet.Columns.IndexOf(column), out firstPersonIndex))
break;
}
throw new Exception("Excel file is not normalized!");
}
- The second part of the
ParseSheet
can be achieved with the following Linq:- Please bear in mind the
.Range
requiresstart
andcount
parameters notstart
andend
- Please bear in mind the
static (string industryName, List<string> peopleNames) ParseSheet(DataTable sheet)
{
var (industryName, firstPersonIndex) = GetIndustryAndFirstPersonIndex(sheet);
var peopleNames = Enumerable.Range(firstPersonIndex.Row, sheet.Rows.Count - firstPersonIndex.Row)
.Select(rowIndex => sheet.Rows[rowIndex][firstPersonIndex.Column].ToString()?.Trim())
.ToList();
return (industryName, peopleNames);
}
-
\$\begingroup\$ Thank you for the comprehensive answer! There are two things tho. 1) Instead of returning
List<string> peopleNames
, it has to beList<Person>
. That's why I thought it would be better to use regex expressions because I need the name and the number as well. 2) The first person index is not always "number's column + 1" because the columns might be ordered in a different way, so the regex expression would be necessary here. \$\endgroup\$nop– nop2022年10月03日 15:16:19 +00:00Commented Oct 3, 2022 at 15:16 -
1\$\begingroup\$ Hi @nop, I have reviewed the code which you have provided. If that's not working as intended then we should not review it. So, is it working as expected? \$\endgroup\$Peter Csala– Peter Csala2022年10月03日 15:25:30 +00:00Commented Oct 3, 2022 at 15:25
-
1\$\begingroup\$ True, I'll try to do it myself or post another question. Thanks! :) \$\endgroup\$nop– nop2022年10月03日 16:48:21 +00:00Commented Oct 3, 2022 at 16:48
ExcelReaderFactory
?ExcelDataReader
andExcelDataReader.DataSet
? \$\endgroup\$