I have built a sample Parser class on top of the csv module for basic checks that we normally need as the first step of pre-processing. I would love a bit of feedback on the code and any areas that I can improve upon in the code to make it more efficient.
import csv
from exceptions import CompulsoryHeaderCheckFailed
class CsvParser:
def __init__(self, file: str, compulsory_headers: list):
self.file = file
self.compulsory_headers = compulsory_headers
self.duplicate_rows = []
self.good_rows = []
self.bad_rows = []
self.log = []
def diff_headers(self):
reader = csv.reader(open(self.file))
header = next(reader)
return set(self.compulsory_headers) - set(header)
def passes_compulsory_header_check(self):
return True if not self.diff_headers() else False
def is_bad_row(self, line_num: int, curr_row: tuple):
if not all(curr_row):
self.bad_rows.append(({"line_num": line_num, "row": list(curr_row)}))
return True
return False
def _append_log(self, line_num: int):
self.log.append({"total_num_of_rows_processed": line_num,
"good_rows": self.good_rows,
"bad_rows": self.bad_rows,
"duplicate_rows": self.duplicate_rows,
})
@property
def row(self):
rows = csv.reader(open(self.file))
next(rows)
for _row in rows:
yield _row
@property
def col(self):
rows = csv.reader(open(self.file))
for row in rows:
return row
def _parse_rows(self):
_rows = set()
rows = csv.reader(open(self.file))
next(rows)
line_num = 1
_rows = set()
for row in rows:
line_num += 1
row = tuple(row)
if not (self.is_bad_row(line_num, row)):
if row not in _rows:
_rows.add(row)
self.good_rows.append({"line_num": line_num, "row": list(row)})
else:
self.duplicate_rows.append({"line_num": line_num,
"row": list(row)
})
self._append_log(line_num)
def process_csv(self, skip_header_check=False):
if skip_header_check:
self._parse_rows()
else:
if self.passes_compulsory_header_check():
self._parse_rows()
else:
raise CompulsoryHeaderCheckFailed("compulsory headers missing from csv")
csv_parser = CsvParser("hello.csv", compulsory_headers=["id", "student", "marks", ])
print(csv_parser.process_csv())
print(csv_parser.good_rows)
print(csv_parser.bad_rows)
print(csv_parser.duplicate_rows)
print(csv_parser.log)
-
2\$\begingroup\$ Please see What to do when someone answers . I have rolled back Rev 4 → 2. \$\endgroup\$200_success– 200_success2022年04月21日 08:54:55 +00:00Commented Apr 21, 2022 at 8:54
-
1\$\begingroup\$ @200_success you've got answering, editing, and modding on the same Q&A. There should be an achievement/badge for this. :) \$\endgroup\$chicks– chicks2022年04月25日 17:15:31 +00:00Commented Apr 25, 2022 at 17:15
2 Answers 2
Interface
The interface for this class is awkward, and it's not clear how it's meant to be used.
- Is
diff_headers()
supposed to be part of the interface? How aboutpasses_compulsory_header_check()
? Maybe they should be considered private, to be used internally byprocess_csv()
? - The
is_bad_row(line_num, curr_row)
method definitely seems like it's supposed to be a private method for internal use: how else would a caller come up with a line number and a tuple of data to pass to it? The worst aspect of this method is that even though its name suggests that it just performs a test, it actually has a side-effect of maybe appending something toself.bad_rows
. - The
_append_log()
method appends toself.log
, but considering that the "log" should contain a single item that is a summary of the outcome of parsing the entire file, why isself.log
a list at all? For that matter, why should there be a "log" at all? As your sample usage shows, you can get that information fromcsv_parser.good_rows
,csv_parser.bad_rows
, etc. — all that's missing is the line count, and you could modify the class so that the line count could be queried likewise. - It's highly unorthodox that
row()
is a property that is also a generator — especially since the name of the method is singular rather than plural. - I don't know what the
col()
method is for, but it looks entirely wrong, and you never call it. - The
process_csv()
method looks somewhat reasonable. But why is there askip_header_check
option? Couldn't the caller also effectively skip the header check by specifying no compulsory headers? It seems simpler and more Pythonic to offer just one way to accomplish a particular goal.
Design suggestion
If I had to write this class, I'd design it so that it works pretty much as a drop-in replacement for the csv.DictReader
class, such that iterating over the reader streams the "good" rows, but keeps track of the bad and duplicate rows that it encounters.
Implementation
In several places, you write reader = csv.reader(open(self.file))
, often followed by next(rows)
to skip the header row. It would be better to design the class so that it opens the file just once, and makes one linear pass through its contents. Also, a csv.DictReader
would be a better way to handle the header row.
What exactly constitutes a "bad" row? A row that has an empty field? Your criterion is all(curr_row)
, but you should be aware that if the row has fewer fields than expected, that will be considered acceptable.
In _parse_rows()
, you initialized _rows = set()
twice. To write a for
loop that maintains a count while iterating, the Python idiom is to use enumerate()
.
-
\$\begingroup\$ Thank you for the feedback. I will refactor the code with the above suggestions and do an update here with the updated code. \$\endgroup\$SDRJ– SDRJ2022年04月21日 07:17:18 +00:00Commented Apr 21, 2022 at 7:17
Based on the feedback from 200_success, here is revised version of the code
import csv
from exceptions import CompulsoryHeaderCheckFailed
class CsvParser:
def __init__(self, file: str, compulsory_headers: list):
self.file = file
self.compulsory_headers = compulsory_headers
self.duplicate_rows = []
self.good_rows = []
self.bad_rows = []
self.reader = None
self.total_lines_processed = None
self.num_of_columns = None
def _reader(self):
self.reader = csv.DictReader(open(self.file))
def _diff_headers(self):
reader = self.reader
header = reader.fieldnames
self.num_of_columns = len(header)
return set(self.compulsory_headers) - set(header)
def _passes_compulsory_header_check(self):
return True if not self._diff_headers() else False
def _parse_rows(self):
rows = self.reader
_rows = set()
line_num = None
for line_num, row in enumerate(rows, start=1):
row = tuple(row.values())
if all(row) and self.num_of_columns == len(row):
if row not in _rows:
_rows.add(row)
self.good_rows.append({"line_num": line_num, "row": list(row)})
else:
self.duplicate_rows.append({"line_num": line_num,
"row": list(row)
})
else:
self.bad_rows.append(({"line_num": line_num, "row": list(row)}))
self.total_lines_processed = line_num
def process_csv(self):
self._reader()
if not self.compulsory_headers:
self._parse_rows()
else:
if self._passes_compulsory_header_check():
self._parse_rows()
else:
raise CompulsoryHeaderCheckFailed("compulsory headers missing from csv")
csv_parser = CsvParser("hello.csv", compulsory_headers=["id", "student", "marks", ])
csv_parser.process_csv()
print(csv_parser.good_rows)
print(csv_parser.bad_rows)
print(csv_parser.duplicate_rows)
print(csv_parser.total_lines_processed)
-
\$\begingroup\$ @200_success - Thank you for the advise. I made few edits in the code. Can you please advise on the revised version ? \$\endgroup\$SDRJ– SDRJ2022年04月21日 10:37:44 +00:00Commented Apr 21, 2022 at 10:37
-
3\$\begingroup\$ Thanks for posting this code. It's a good idea to summarise which changes you made, and why - a self-answer ought to review the code, just like any other answer. \$\endgroup\$Toby Speight– Toby Speight2022年04月21日 12:25:29 +00:00Commented Apr 21, 2022 at 12:25
-
2\$\begingroup\$ If you have additional or improved code for review, you should post it as a new question instead. \$\endgroup\$Toby Speight– Toby Speight2022年04月21日 12:26:49 +00:00Commented Apr 21, 2022 at 12:26
-
\$\begingroup\$ The
if not self.compulsory_headers: ...
special case is a bad idea. Do you see how that codepath can cause a crash? If I recall correctly,DictReader
already enforces that each row contains the expected number of fields. \$\endgroup\$200_success– 200_success2022年04月21日 14:45:01 +00:00Commented Apr 21, 2022 at 14:45 -
\$\begingroup\$ @200_success - Yeah you are right. I will make the edit \$\endgroup\$SDRJ– SDRJ2022年04月22日 10:07:36 +00:00Commented Apr 22, 2022 at 10:07