I'd like to eliminate the code duplication in the following two methods by moving the common code into a separate method callable by both. The comments indicate the blocks of code that have a differing implementation in each method.
def compute_totals_h(self, size, bad_codes):
with open(self._in_file_path, self._FILE_READ_MODE) as f:
# initialize reader
reader = csv.reader(f)
field_names = reader.next()
for i, code in enumerate(field_names):
code = code.strip(string.punctuation).upper()
field_names[i] = code
for code in field_names:
if (len(code) <= size) and (code not in bad_codes):
self._totals[code] = 0
for row in reader:
# get totals
for i, val in enumerate(row):
code = field_names[i]
if (code in self._totals):
self._totals[code] += string_utils.to_int(val)
self._write_totals()
def compute_totals_v(self, code_field, est_field):
with open(self._in_file_path, self._FILE_READ_MODE) as f:
# initialize reader
reader = csv.DictReader(f)
for row in reader:
# get totals
code = row[code_field].strip(string.punctuation).upper()
est = string_utils.to_int(row[est_field])
if code in self._totals:
self._totals[code] += est
else:
self._totals[code] = est
self._write_totals()
I'm thinking of a solution that has a common abstract method that can be called from compute_totals_h
and compute_totals_v
with each method passing functions to handle its implementation. I can't figure out how to do this while passing the arguments for each implementation correctly. It would look something like:
def compute_totals(self, initialize_reader, get_totals):
with open(self._in_file_path, self._FILE_READ_MODE) as f:
reader = initialize_reader(f)
for row in reader:
get_totals(row)
self._write_totals()
I'll also appreciate suggestions on a better way of handling this type of code refactoring to eliminate this general class of code duplication problems.
1 Answer 1
Where we improve the existing code
First of, let's start by talking a bit about some issues in your code. We'll see if we can factorize some bits latter.
The first thing to note is that you iterate twice over your field_names
to construct its final form and initialize self._totals
. There is no need for it:
field_names = reader.next()
for i, code in enumerate(field_names):
code = code.strip(string.punctuation).upper()
field_names[i] = code
if (len(code) <= size) and (code not in bad_codes):
self._totals[code] = 0
Next, we see that you differentiate the behaviours of your sums whether self._totals[code]
has been initalized or not. One way to simplify it is to turn it into a collections.defaultdict
. To be more precise, you will have to define self._totals = defaultdict(int)
instead of self._totals = dict()
. This will allow you to write, in compute_totals_v
:
for row in reader:
# get totals
code = row[code_field].strip(string.punctuation).upper()
self._totals[code] += string_utils.to_int(row[est_field])
This will also allow you to write, in compute_totals_h
:
for row in reader:
# get totals
for code, val in zip(field_names, row):
if (len(code) <= size) and (code not in bad_codes):
self._totals[code] += string_utils.to_int(val)
eliminating the need to initialize self._totals
fields. Also, using zip
here is more idiomatic. And since you are using Python 2 (I guess reader.next()
would have been next(reader)
in Python 3) you can use itertools.izip
to avoid building a new list in memory.
We can see here that we're moving the test that initialized self._totals[code]
into the "get totals" loop. But it is not efficient since we are re-doing it for each row of data. Time to improve that filtering by getting something closer to the if code in self._totals
we had. The easiest way is to turn invalid codes into None
and check if code is not None
instead:
def compute_totals_h(self, size, bad_codes):
with open(self._in_file_path, self._FILE_READ_MODE) as f:
# initialize reader
reader = csv.reader(f)
field_names = [c if (len(c) <= size) and (c not in bad_code)
else None
for c in (
code.strip(string.punctuation).upper()
for code in reader.next())]
for row in reader:
# get totals
for code, val in zip(field_names, row):
if code is not None:
self._totals[code] += string_utils.to_int(val)
self._write_totals()
We now have something closer to compute_totals_v
and can start to work on factorizing.
Where we get to do what you ask for
The general layout you extracted for your compute_totals
method is quite close to what compute_totals_v
and compute_totals_h
has become. You should note, however, that compute_totals_h
needs to return both reader
and field_names
for its initialization. So we need to return a dummy value (let's say None
) in the initialization function for compute_totals_v
as well. And thus, we need to accept a second parameter (the dummy value or field_names
) in get_totals
for both methods.
We also need to define the helper functions as inner functions of the method so we can capture the parameters of the method once and for all without having to pass them around.
# Somewhere, where you define self._totals:
# self._totals = defaultdict(int)
def compute_totals_h(self, size, bad_codes):
def _init_reader(file_handler):
reader = csv.reader(file_handler)
field_names = [c if (len(c) <= size) and (c not in bad_code)
else None
for c in (
code.strip(string.punctuation).upper()
for code in reader.next())]
return reader, field_names
def _get_totals(row, field_names):
for code, val in zip(field_names, row):
if code is not None:
self._totals[code] += string_utils.to_int(val)
self.compute_totals(_init_reader, _get_totals)
def compute_totals_v(self, code_field, est_field):
def _get_totals(row, *args):
code = row[code_field].strip(string.punctuation).upper()
self._totals[code] += string_utils.to_int(row[est_field])
self.compute_totals(lambda f: (csv.DictReader(f), None), _get_totals)
def compute_totals(self, initialize_reader, get_totals):
with open(self._in_file_path, self._FILE_READ_MODE) as f:
reader, field_names = initialize_reader(f)
for row in reader:
get_totals(row, field_names)
self._write_totals()
I'm not convinced, however, of the improvement of this change. Readability seems worse to me. And calling the get_totals
function for each row has its slight overhead.
-
\$\begingroup\$ This answer is brilliant and exactly what I was looking for! Also, thanks for the code improvements. I agree on the decreased readability but it seems a like a good and extensible solution to this general class of factoring problems. It also seems more concise than using abstract base classes. \$\endgroup\$Stephen K. Karanja– Stephen K. Karanja2015年11月14日 08:36:21 +00:00Commented Nov 14, 2015 at 8:36
string_utils.to_int
? Any reason to not use the builtinint
? \$\endgroup\$code_field
andest_field
? \$\endgroup\$int
since I need to handle strings that are not numeric in a predictable way (returning0
).string_utils.to_int
is a wrapper function aroundint
which catchesValueError
andTypeError
exceptions. \$\endgroup\$