CSV file parser with some checks for header fields and duplicate rows

Question 1

I have built a sample Parser class on top of the csv module for basic checks that we normally need as the first step of pre-processing. I would love a bit of feedback on the code and any areas that I can improve upon in the code to make it more efficient.

import csv
from exceptions import CompulsoryHeaderCheckFailed
class CsvParser:
 def __init__(self, file: str, compulsory_headers: list):
 self.file = file
 self.compulsory_headers = compulsory_headers
 self.duplicate_rows = []
 self.good_rows = []
 self.bad_rows = []
 self.log = []
 def diff_headers(self):
 reader = csv.reader(open(self.file))
 header = next(reader)
 return set(self.compulsory_headers) - set(header)
 def passes_compulsory_header_check(self):
 return True if not self.diff_headers() else False
 def is_bad_row(self, line_num: int, curr_row: tuple):
 if not all(curr_row):
 self.bad_rows.append(({"line_num": line_num, "row": list(curr_row)}))
 return True
 return False
 def _append_log(self, line_num: int):
 self.log.append({"total_num_of_rows_processed": line_num,
 "good_rows": self.good_rows,
 "bad_rows": self.bad_rows,
 "duplicate_rows": self.duplicate_rows,
 })
 @property
 def row(self):
 rows = csv.reader(open(self.file))
 next(rows)
 for _row in rows:
 yield _row
 @property
 def col(self):
 rows = csv.reader(open(self.file))
 for row in rows:
 return row
 def _parse_rows(self):
 _rows = set()
 rows = csv.reader(open(self.file))
 next(rows)
 line_num = 1
 _rows = set()
 for row in rows:
 line_num += 1
 row = tuple(row)
 if not (self.is_bad_row(line_num, row)):
 if row not in _rows:
 _rows.add(row)
 self.good_rows.append({"line_num": line_num, "row": list(row)})
 else:
 self.duplicate_rows.append({"line_num": line_num,
 "row": list(row)
 })
 self._append_log(line_num)
 def process_csv(self, skip_header_check=False):
 if skip_header_check:
 self._parse_rows()
 else:
 if self.passes_compulsory_header_check():
 self._parse_rows()
 else:
 raise CompulsoryHeaderCheckFailed("compulsory headers missing from csv")
csv_parser = CsvParser("hello.csv", compulsory_headers=["id", "student", "marks", ])
print(csv_parser.process_csv())
print(csv_parser.good_rows)
print(csv_parser.bad_rows)
print(csv_parser.duplicate_rows)
print(csv_parser.log)

Question 2

Please see What to do when someone answers . I have rolled back Rev 4 → 2.

Question 3

@200_success you've got answering, editing, and modding on the same Q&A. There should be an achievement/badge for this. :)

Question 4

Interface

The interface for this class is awkward, and it's not clear how it's meant to be used.

Is diff_headers() supposed to be part of the interface? How about passes_compulsory_header_check()? Maybe they should be considered private, to be used internally by process_csv()?
The is_bad_row(line_num, curr_row) method definitely seems like it's supposed to be a private method for internal use: how else would a caller come up with a line number and a tuple of data to pass to it? The worst aspect of this method is that even though its name suggests that it just performs a test, it actually has a side-effect of maybe appending something to self.bad_rows.
The _append_log() method appends to self.log, but considering that the "log" should contain a single item that is a summary of the outcome of parsing the entire file, why is self.log a list at all? For that matter, why should there be a "log" at all? As your sample usage shows, you can get that information from csv_parser.good_rows, csv_parser.bad_rows, etc. — all that's missing is the line count, and you could modify the class so that the line count could be queried likewise.
It's highly unorthodox that row() is a property that is also a generator — especially since the name of the method is singular rather than plural.
I don't know what the col() method is for, but it looks entirely wrong, and you never call it.
The process_csv() method looks somewhat reasonable. But why is there a skip_header_check option? Couldn't the caller also effectively skip the header check by specifying no compulsory headers? It seems simpler and more Pythonic to offer just one way to accomplish a particular goal.

Design suggestion

If I had to write this class, I'd design it so that it works pretty much as a drop-in replacement for the csv.DictReader class, such that iterating over the reader streams the "good" rows, but keeps track of the bad and duplicate rows that it encounters.

Implementation

In several places, you write reader = csv.reader(open(self.file)), often followed by next(rows) to skip the header row. It would be better to design the class so that it opens the file just once, and makes one linear pass through its contents. Also, a csv.DictReader would be a better way to handle the header row.

What exactly constitutes a "bad" row? A row that has an empty field? Your criterion is all(curr_row), but you should be aware that if the row has fewer fields than expected, that will be considered acceptable.

In _parse_rows(), you initialized _rows = set() twice. To write a for loop that maintains a count while iterating, the Python idiom is to use enumerate().

Question 5

Thank you for the feedback. I will refactor the code with the above suggestions and do an update here with the updated code.

Question 6

Based on the feedback from 200_success, here is revised version of the code

import csv
from exceptions import CompulsoryHeaderCheckFailed
class CsvParser:
 def __init__(self, file: str, compulsory_headers: list):
 self.file = file
 self.compulsory_headers = compulsory_headers
 self.duplicate_rows = []
 self.good_rows = []
 self.bad_rows = []
 self.reader = None
 self.total_lines_processed = None
 self.num_of_columns = None
 def _reader(self):
 self.reader = csv.DictReader(open(self.file))
 def _diff_headers(self):
 reader = self.reader
 header = reader.fieldnames
 self.num_of_columns = len(header)
 return set(self.compulsory_headers) - set(header)
 def _passes_compulsory_header_check(self):
 return True if not self._diff_headers() else False
 def _parse_rows(self):
 rows = self.reader
 _rows = set()
 line_num = None
 for line_num, row in enumerate(rows, start=1):
 row = tuple(row.values())
 if all(row) and self.num_of_columns == len(row):
 if row not in _rows:
 _rows.add(row)
 self.good_rows.append({"line_num": line_num, "row": list(row)})
 else:
 self.duplicate_rows.append({"line_num": line_num,
 "row": list(row)
 })
 else:
 self.bad_rows.append(({"line_num": line_num, "row": list(row)}))
 self.total_lines_processed = line_num
 def process_csv(self):
 self._reader()
 if not self.compulsory_headers:
 self._parse_rows()
 else:
 if self._passes_compulsory_header_check():
 self._parse_rows()
 else:
 raise CompulsoryHeaderCheckFailed("compulsory headers missing from csv")
csv_parser = CsvParser("hello.csv", compulsory_headers=["id", "student", "marks", ])
csv_parser.process_csv()
print(csv_parser.good_rows)
print(csv_parser.bad_rows)
print(csv_parser.duplicate_rows)
print(csv_parser.total_lines_processed)

Question 7

@200_success - Thank you for the advise. I made few edits in the code. Can you please advise on the revised version ?

Question 8

Thanks for posting this code. It's a good idea to summarise which changes you made, and why - a self-answer ought to review the code, just like any other answer.

Question 9

If you have additional or improved code for review, you should post it as a new question instead.

Question 10

The if not self.compulsory_headers: ... special case is a bad idea. Do you see how that codepath can cause a crash? If I recall correctly, DictReader already enforces that each row contains the expected number of fields.

Question 11

@200_success - Yeah you are right. I will make the edit

score 6 · Accepted Answer · 2022-04-21 07:10:31Z

Interface

The interface for this class is awkward, and it's not clear how it's meant to be used.

Is diff_headers() supposed to be part of the interface? How about passes_compulsory_header_check()? Maybe they should be considered private, to be used internally by process_csv()?
The is_bad_row(line_num, curr_row) method definitely seems like it's supposed to be a private method for internal use: how else would a caller come up with a line number and a tuple of data to pass to it? The worst aspect of this method is that even though its name suggests that it just performs a test, it actually has a side-effect of maybe appending something to self.bad_rows.
The _append_log() method appends to self.log, but considering that the "log" should contain a single item that is a summary of the outcome of parsing the entire file, why is self.log a list at all? For that matter, why should there be a "log" at all? As your sample usage shows, you can get that information from csv_parser.good_rows, csv_parser.bad_rows, etc. — all that's missing is the line count, and you could modify the class so that the line count could be queried likewise.
It's highly unorthodox that row() is a property that is also a generator — especially since the name of the method is singular rather than plural.
I don't know what the col() method is for, but it looks entirely wrong, and you never call it.
The process_csv() method looks somewhat reasonable. But why is there a skip_header_check option? Couldn't the caller also effectively skip the header check by specifying no compulsory headers? It seems simpler and more Pythonic to offer just one way to accomplish a particular goal.

Design suggestion

If I had to write this class, I'd design it so that it works pretty much as a drop-in replacement for the csv.DictReader class, such that iterating over the reader streams the "good" rows, but keeps track of the bad and duplicate rows that it encounters.

Implementation

In several places, you write reader = csv.reader(open(self.file)), often followed by next(rows) to skip the header row. It would be better to design the class so that it opens the file just once, and makes one linear pass through its contents. Also, a csv.DictReader would be a better way to handle the header row.

What exactly constitutes a "bad" row? A row that has an empty field? Your criterion is all(curr_row), but you should be aware that if the row has fewer fields than expected, that will be considered acceptable.

In _parse_rows(), you initialized _rows = set() twice. To write a for loop that maintains a count while iterating, the Python idiom is to use enumerate().

Thank you for the feedback. I will refactor the code with the above suggestions and do an update here with the updated code.

Stack Exchange Network

CSV file parser with some checks for header fields and duplicate rows

2 Answers 2

Interface

Design suggestion

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CSV file parser with some checks for header fields and duplicate rows

2 Answers 2

Interface

Design suggestion

Implementation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions