Computing sums in some CSV files horizontally and in other CSV files vertically

Question 1

I'd like to eliminate the code duplication in the following two methods by moving the common code into a separate method callable by both. The comments indicate the blocks of code that have a differing implementation in each method.

def compute_totals_h(self, size, bad_codes):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 # initialize reader
 reader = csv.reader(f)
 field_names = reader.next()
 for i, code in enumerate(field_names):
 code = code.strip(string.punctuation).upper()
 field_names[i] = code 
 for code in field_names:
 if (len(code) <= size) and (code not in bad_codes):
 self._totals[code] = 0
 for row in reader:
 # get totals
 for i, val in enumerate(row):
 code = field_names[i]
 if (code in self._totals):
 self._totals[code] += string_utils.to_int(val)
 self._write_totals()
def compute_totals_v(self, code_field, est_field):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 # initialize reader
 reader = csv.DictReader(f)
 for row in reader:
 # get totals
 code = row[code_field].strip(string.punctuation).upper()
 est = string_utils.to_int(row[est_field])
 if code in self._totals:
 self._totals[code] += est
 else:
 self._totals[code] = est
 self._write_totals()

I'm thinking of a solution that has a common abstract method that can be called from compute_totals_h and compute_totals_v with each method passing functions to handle its implementation. I can't figure out how to do this while passing the arguments for each implementation correctly. It would look something like:

def compute_totals(self, initialize_reader, get_totals):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 reader = initialize_reader(f)
 for row in reader:
 get_totals(row)
 self._write_totals()

I'll also appreciate suggestions on a better way of handling this type of code refactoring to eliminate this general class of code duplication problems.

Question 2

string_utils.to_int? Any reason to not use the builtin int?

Question 3

Welcome to Code Review! So that we may provide you with better advice, could you tell us more about the task you are trying to accomplish? What class does this code appear in? Why do you have a length limit on field names? What is code_field and est_field?

Question 4

@MathiasEttinger I'm not using int since I need to handle strings that are not numeric in a predictable way (returning 0). string_utils.to_int is a wrapper function around int which catches ValueError and TypeError exceptions.

Question 5

@200_success your edits actually capture the nature of the task. One set of CSV files has codes spread across columns another has codes appearing in rows within one column. The two methods handle the two formats. I'd like to merge their common parts. Does this help?

Question 6

It does help to know that there are multiple sets of files involved.

Question 7

Where we improve the existing code

First of, let's start by talking a bit about some issues in your code. We'll see if we can factorize some bits latter.

The first thing to note is that you iterate twice over your field_names to construct its final form and initialize self._totals. There is no need for it:

field_names = reader.next()
for i, code in enumerate(field_names):
 code = code.strip(string.punctuation).upper()
 field_names[i] = code
 if (len(code) <= size) and (code not in bad_codes):
 self._totals[code] = 0

Next, we see that you differentiate the behaviours of your sums whether self._totals[code] has been initalized or not. One way to simplify it is to turn it into a collections.defaultdict. To be more precise, you will have to define self._totals = defaultdict(int) instead of self._totals = dict(). This will allow you to write, in compute_totals_v:

for row in reader:
 # get totals
 code = row[code_field].strip(string.punctuation).upper()
 self._totals[code] += string_utils.to_int(row[est_field])

This will also allow you to write, in compute_totals_h:

for row in reader:
 # get totals
 for code, val in zip(field_names, row):
 if (len(code) <= size) and (code not in bad_codes):
 self._totals[code] += string_utils.to_int(val)

eliminating the need to initialize self._totals fields. Also, using zip here is more idiomatic. And since you are using Python 2 (I guess reader.next() would have been next(reader) in Python 3) you can use itertools.izip to avoid building a new list in memory.

We can see here that we're moving the test that initialized self._totals[code] into the "get totals" loop. But it is not efficient since we are re-doing it for each row of data. Time to improve that filtering by getting something closer to the if code in self._totals we had. The easiest way is to turn invalid codes into None and check if code is not None instead:

def compute_totals_h(self, size, bad_codes):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 # initialize reader
 reader = csv.reader(f)
 field_names = [c if (len(c) <= size) and (c not in bad_code)
 else None
 for c in (
 code.strip(string.punctuation).upper()
 for code in reader.next())]
 for row in reader:
 # get totals
 for code, val in zip(field_names, row):
 if code is not None:
 self._totals[code] += string_utils.to_int(val)
 self._write_totals()

We now have something closer to compute_totals_v and can start to work on factorizing.

Where we get to do what you ask for

The general layout you extracted for your compute_totals method is quite close to what compute_totals_v and compute_totals_h has become. You should note, however, that compute_totals_h needs to return both reader and field_names for its initialization. So we need to return a dummy value (let's say None) in the initialization function for compute_totals_v as well. And thus, we need to accept a second parameter (the dummy value or field_names) in get_totals for both methods.

We also need to define the helper functions as inner functions of the method so we can capture the parameters of the method once and for all without having to pass them around.

# Somewhere, where you define self._totals:
# self._totals = defaultdict(int)
def compute_totals_h(self, size, bad_codes):
 def _init_reader(file_handler):
 reader = csv.reader(file_handler)
 field_names = [c if (len(c) <= size) and (c not in bad_code)
 else None
 for c in (
 code.strip(string.punctuation).upper()
 for code in reader.next())]
 return reader, field_names
 def _get_totals(row, field_names):
 for code, val in zip(field_names, row):
 if code is not None:
 self._totals[code] += string_utils.to_int(val)
 self.compute_totals(_init_reader, _get_totals)
def compute_totals_v(self, code_field, est_field):
 def _get_totals(row, *args):
 code = row[code_field].strip(string.punctuation).upper()
 self._totals[code] += string_utils.to_int(row[est_field])
 self.compute_totals(lambda f: (csv.DictReader(f), None), _get_totals)
def compute_totals(self, initialize_reader, get_totals):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 reader, field_names = initialize_reader(f)
 for row in reader:
 get_totals(row, field_names)
 self._write_totals()

I'm not convinced, however, of the improvement of this change. Readability seems worse to me. And calling the get_totals function for each row has its slight overhead.

Question 8

This answer is brilliant and exactly what I was looking for! Also, thanks for the code improvements. I agree on the decreased readability but it seems a like a good and extensible solution to this general class of factoring problems. It also seems more concise than using abstract base classes.

score 2 · Accepted Answer · 2015-11-13 20:06:43Z

Where we improve the existing code

First of, let's start by talking a bit about some issues in your code. We'll see if we can factorize some bits latter.

The first thing to note is that you iterate twice over your field_names to construct its final form and initialize self._totals. There is no need for it:

field_names = reader.next()
for i, code in enumerate(field_names):
 code = code.strip(string.punctuation).upper()
 field_names[i] = code
 if (len(code) <= size) and (code not in bad_codes):
 self._totals[code] = 0

Next, we see that you differentiate the behaviours of your sums whether self._totals[code] has been initalized or not. One way to simplify it is to turn it into a collections.defaultdict. To be more precise, you will have to define self._totals = defaultdict(int) instead of self._totals = dict(). This will allow you to write, in compute_totals_v:

for row in reader:
 # get totals
 code = row[code_field].strip(string.punctuation).upper()
 self._totals[code] += string_utils.to_int(row[est_field])

This will also allow you to write, in compute_totals_h:

for row in reader:
 # get totals
 for code, val in zip(field_names, row):
 if (len(code) <= size) and (code not in bad_codes):
 self._totals[code] += string_utils.to_int(val)

eliminating the need to initialize self._totals fields. Also, using zip here is more idiomatic. And since you are using Python 2 (I guess reader.next() would have been next(reader) in Python 3) you can use itertools.izip to avoid building a new list in memory.

We can see here that we're moving the test that initialized self._totals[code] into the "get totals" loop. But it is not efficient since we are re-doing it for each row of data. Time to improve that filtering by getting something closer to the if code in self._totals we had. The easiest way is to turn invalid codes into None and check if code is not None instead:

def compute_totals_h(self, size, bad_codes):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 # initialize reader
 reader = csv.reader(f)
 field_names = [c if (len(c) <= size) and (c not in bad_code)
 else None
 for c in (
 code.strip(string.punctuation).upper()
 for code in reader.next())]
 for row in reader:
 # get totals
 for code, val in zip(field_names, row):
 if code is not None:
 self._totals[code] += string_utils.to_int(val)
 self._write_totals()

We now have something closer to compute_totals_v and can start to work on factorizing.

Where we get to do what you ask for

The general layout you extracted for your compute_totals method is quite close to what compute_totals_v and compute_totals_h has become. You should note, however, that compute_totals_h needs to return both reader and field_names for its initialization. So we need to return a dummy value (let's say None) in the initialization function for compute_totals_v as well. And thus, we need to accept a second parameter (the dummy value or field_names) in get_totals for both methods.

We also need to define the helper functions as inner functions of the method so we can capture the parameters of the method once and for all without having to pass them around.

# Somewhere, where you define self._totals:
# self._totals = defaultdict(int)
def compute_totals_h(self, size, bad_codes):
 def _init_reader(file_handler):
 reader = csv.reader(file_handler)
 field_names = [c if (len(c) <= size) and (c not in bad_code)
 else None
 for c in (
 code.strip(string.punctuation).upper()
 for code in reader.next())]
 return reader, field_names
 def _get_totals(row, field_names):
 for code, val in zip(field_names, row):
 if code is not None:
 self._totals[code] += string_utils.to_int(val)
 self.compute_totals(_init_reader, _get_totals)
def compute_totals_v(self, code_field, est_field):
 def _get_totals(row, *args):
 code = row[code_field].strip(string.punctuation).upper()
 self._totals[code] += string_utils.to_int(row[est_field])
 self.compute_totals(lambda f: (csv.DictReader(f), None), _get_totals)
def compute_totals(self, initialize_reader, get_totals):
 with open(self._in_file_path, self._FILE_READ_MODE) as f:
 reader, field_names = initialize_reader(f)
 for row in reader:
 get_totals(row, field_names)
 self._write_totals()

I'm not convinced, however, of the improvement of this change. Readability seems worse to me. And calling the get_totals function for each row has its slight overhead.

This answer is brilliant and exactly what I was looking for! Also, thanks for the code improvements. I agree on the decreased readability but it seems a like a good and extensible solution to this general class of factoring problems. It also seems more concise than using abstract base classes.

Stack Exchange Network

Computing sums in some CSV files horizontally and in other CSV files vertically

1 Answer 1

Where we improve the existing code

Where we get to do what you ask for

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Computing sums in some CSV files horizontally and in other CSV files vertically

1 Answer 1

Where we improve the existing code

Where we get to do what you ask for

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions