The issue I have with this class is that most of the methods are almost the same. I would like for this code to be more pythonic.
Note: I plan on replacing all the for
loop constructs with list comprehensions.
class Sorter(BaseCrawler):
def __init__(self, base_dir):
self.base_dir = base_dir
def by_size(self, *options):
'''
Sort files by size
'''
file_list = []
for fn in self.files(*options):
file_list.append((fn, float(os.path.getsize(fn))))
file_list.sort(key=lambda fn: fn[1])
return file_list
def by_mtime(self, *options):
'''
Sort files by modified time.
'''
file_list = []
for fn in self.files(*options):
file_list.append((fn, os.path.getmtime(fn)))
file_list.sort(key=lambda fn: fn[1])
return file_list
def by_ctime(self, *options):
'''
Sort files by creation time.
'''
file_list = []
for fn in self.files(*options):
file_list.append((fn, os.path.getctime(fn)))
file_list.sort(key=lambda fn: fn[1])
return file_list
def by_atime(self, *options):
'''
Sort files by access time.
'''
file_list = []
for fn in self.files(*options):
file_list.append((fn, os.path.getatime(fn)))
file_list.sort(key=lambda fn: fn[1])
return file_list
def last_modified(self):
'''
Return the last modified file.
'''
return self.by_mtime()[0]
def last_accessed(self):
'''
Return the last accessed file.
'''
return self.by_atime()[0]
def last_created(self):
'''
Return the last created file.
'''
return self.by_ctime()[0]
def average_fsize(self):
'''
Return the average file size.
'''
sizes = []
for name, size in c.by_size():
sizes.append(size)
return sum(sizes) / len(sizes)
def above_avg_size(self, times):
'''
Yield files that have an above average file size.
'''
avg = self.average_fsize()
for f, s in c.by_size():
if s > avg * times:
yield f, s
3 Answers 3
Removing duplication is a very worthy goal. Just "factor out" the feature-extraction functions ("f-e-f"s):
import operator
class Sorter(BaseCrawler):
def __init__(self, base_dir):
self.base_dir = base_dir
def _sort(self, fef, *options):
file_list = [(fn, fef(fn)) for fn in self.files(*options)]
return sorted(file_list, key=operator.itemgetter(1))
def by_size(self, *options):
def fef(fn): return float(os.path.getsize(fn))
return self._sort(fef, *options)
def by_mtime(self, *options):
return self._sort(os.path.getmtime, *options)
etc, etc.
Not sure why by_size
needs that float
rather than the natural int
returned by os.path.getsize
(I'd rather do the float
cast, if needed, downstream), but I've kept that dubious choice to show how best to deal with a f-e-f that's not already a plain existing function (hint: not with a lambda
:-).
Other improvements are feasible which don't deal with duplication but with efficiency. For example, consider:
def last_modified(self):
'''
Return the last modified file.
'''
return self.by_mtime()[0]
this does a lot of work (list building and sorting) that it doesn't need to do (and actually does not meet its name and docstring: it returns the file modified earliest, because the sort in by_mtime
is in ascending order). How much simpler to do...:
def last_modified(self):
'''
Return the last modified file.
'''
return max(self.files(), key=os.path.getmtime)
(this one fixes the bug; to maintain the bug, use min
in lieu of max
).
Since functions are objects too, you can create a mapping to use and replace the by_
functions all with one function.
sort_choices = { 'size': lambda fn: float(os.path.getsize(fn)),
'mtime': lambda fn: os.path.getmtime(fn),
# ...
}
def sorted_by(self, by, *options):
file_list = [(fn, sort_choices[by](fn)) for fn in self.files(*options)]
file_list.sort(key=lambda fn: fn[1])
return file_list
Whenever you find yourself writing almost the same code over and over, it's a strong clue that what you're doing might just be a special case of something more general. In this case you've got a lot of methods like that. After looking all of them over and noting how they vary, you can often reduce each one to a single call to a new generic method you'll write which can be supplied with arguments to tell which variation of the processing it needs to perform.
I've done that your sample code and decided to add a new generic function named _apply()
to perform the repetitious code. It has a leading underscore character to indicate it's a private method not meant to be called from outside the class. It's arguments are a function to call to get the path attribute, and lastly a list of options to pass on to the self.files()
method not shown.
I did the same sort of analysis on the last_modified()
, last_accessed()
, and last_created()
methods which resulted in the addition of another new generic method named _last_time()
being added. Afterwards it enabled the rewriting of the latter them methods.
That kind approach wasn't applicable to the remaining average_fsize()
and above_avg_size()
methods, but it is possible to optimize them a bit as shown.
import operator
class Sorter(BaseCrawler):
def __init__(self, base_dir):
self.base_dir = base_dir
# added
def _apply(path_attribute, *options):
'''
Get and sort path attribute using supplied function
'''
file_list = [(fn, path_attribute(fn)) for fn in self.files(*options)]
file_list.sort(key=operator.itemgetter(1)) # sort by the attributes
return file_list
# added
def _last_time(self, path_time_attribute):
'''
Get lastest path time attribute using supplied function
'''
return max(self.files(), key=path_time_attribute)
def by_size(self, *options):
'''
Sort files by size
'''
filesizes = self._apply(os.path.getsize, *options))
# convert the file sizes to float before returning
return [(fn, float(sz)) for fn, sz in filesizes)]
def by_mtime(self, *options):
'''
Sort files by modified time.
'''
return self._apply(os.path.getmtime, *options)
def by_ctime(self, *options):
'''
Sort files by creation time.
'''
return self._apply(os.path.getctime, *options)
def by_atime(self, *options):
'''
Sort files by access time.
'''
return self._apply(os.path.getatime, *options)
def last_modified(self):
'''
Return the last modified file.
'''
return self._last_time(os.path.getmtime)
def last_accessed(self):
'''
Return the last accessed file.
'''
return self._last_time(os.path.getatime)
def last_created(self):
'''
Return the last created file.
'''
return self._last_time(os.path.getctime)
def average_fsize(self):
'''
Return the average file size.
'''
filesizes = c.by_size() # list of filename, size pairs
return sum(fs[1] for fs in filesizes) / len(filesizes)
def above_avg_size(self, times):
'''
Yield files that have an above average file size.
'''
# doesn't call self.average_fsize() to avoid getting a file sizes twice
filesizes = c.by_size() # list of filename, size pairs
avg = sum(fs[1] for fs in filesizes) / len(filesizes)
for fn, sz in filesizes:
if sz > avg * times:
yield fn, sz
-
\$\begingroup\$ You can, alas, generalize too much: if you have any
options
toby_atime
, the first one will be used as argumentresult_type
and thus called as a function, likely resulting in a crash. \$\endgroup\$Alex Martelli– Alex Martelli2015年01月17日 15:18:33 +00:00Commented Jan 17, 2015 at 15:18 -
\$\begingroup\$ @Alex: Good point -- but fortunately fairly easy to fix in this instance by simply requiring callers -- which should all be defined within the class -- to always explicitly supply a value for the
result_type
argument. Despite the possibly of mistakes of this kind, wouldn't you agree that this is a valid, useful, (not to mention common approach)? It's what Author Martin Fowler, et al., call "refactoring" in their Refactoring: Improving the Design of Existing Code book. \$\endgroup\$martineau– martineau2015年01月17日 19:53:25 +00:00Commented Jan 17, 2015 at 19:53 -
\$\begingroup\$ There are many forms of refactoring, and one can go overboard. All those
None
s in almost all cases weigh down the code's legibility, and thus are an inferior alternative to the much sharper choice of just requiring callers to provide a feature extraction function as in my answer -- look at how much simpler my A's_sort
is wrt yours'_apply
! My approach ends up with one caller needing a localdef
(and it's dubious that even that is needed, as I still don't see whyint
is not just as good asfloat
there!-). Root cause: "feature extraction function" has real semantics itself... \$\endgroup\$Alex Martelli– Alex Martelli2015年01月17日 20:34:13 +00:00Commented Jan 17, 2015 at 20:34 -
\$\begingroup\$ @Alex: Yes, I see what you're saying and agree about maybe having gone too far on the abstraction by including the type conversion in it, too -- especially since it would be only be used in one somewhat nebulous situation. Removing it allows
_apply
to be simplified to point where it's pretty "legible" in my opinion -- so the superiority of what you did is less obvious IMHO. Encouraged by this, I went ahead and applied the approach to several more methods just for good measure. Thanks for your input -- all points duly noted. \$\endgroup\$martineau– martineau2015年01月18日 03:03:36 +00:00Commented Jan 18, 2015 at 3:03 -
\$\begingroup\$ Getting closer and closer to my A, I see. So for example where I have
return self._sort(os.path.getmtime, *options)
you now havereturn self._apply(os.path.getmtime, *options)
-- almost identical except that the name of the auxiliary method I used yesterday hints at the fact that it's chiefly responsible for sorting, while yours as of 15 minutes ago is extremely generic. In one other minor issue, your_apply
is three lines including a bedraggledlambda
-- mine is 33% smaller and uses faster, well-designedoperator.itemgetter
. Sorry, I still fail to see how your A is better... \$\endgroup\$Alex Martelli– Alex Martelli2015年01月18日 03:15:15 +00:00Commented Jan 18, 2015 at 3:15