Displaying sorted results of a web crawl

Question 1

The issue I have with this class is that most of the methods are almost the same. I would like for this code to be more pythonic.

Note: I plan on replacing all the for loop constructs with list comprehensions.

class Sorter(BaseCrawler):
 def __init__(self, base_dir):
 self.base_dir = base_dir
 def by_size(self, *options):
 '''
 Sort files by size
 '''
 file_list = []
 for fn in self.files(*options):
 file_list.append((fn, float(os.path.getsize(fn))))
 file_list.sort(key=lambda fn: fn[1])
 return file_list 
 def by_mtime(self, *options):
 '''
 Sort files by modified time.
 '''
 file_list = []
 for fn in self.files(*options):
 file_list.append((fn, os.path.getmtime(fn)))
 file_list.sort(key=lambda fn: fn[1])
 return file_list 
 def by_ctime(self, *options):
 '''
 Sort files by creation time.
 '''
 file_list = []
 for fn in self.files(*options):
 file_list.append((fn, os.path.getctime(fn)))
 file_list.sort(key=lambda fn: fn[1])
 return file_list 
 def by_atime(self, *options):
 '''
 Sort files by access time.
 '''
 file_list = []
 for fn in self.files(*options):
 file_list.append((fn, os.path.getatime(fn)))
 file_list.sort(key=lambda fn: fn[1])
 return file_list 
 def last_modified(self):
 '''
 Return the last modified file.
 '''
 return self.by_mtime()[0] 
 def last_accessed(self):
 '''
 Return the last accessed file.
 '''
 return self.by_atime()[0] 
 def last_created(self):
 '''
 Return the last created file.
 '''
 return self.by_ctime()[0]
 def average_fsize(self):
 '''
 Return the average file size.
 '''
 sizes = []
 for name, size in c.by_size():
 sizes.append(size)
 return sum(sizes) / len(sizes)
 def above_avg_size(self, times):
 '''
 Yield files that have an above average file size.
 '''
 avg = self.average_fsize()
 for f, s in c.by_size():
 if s > avg * times:
 yield f, s

Question 2

Removing duplication is a very worthy goal. Just "factor out" the feature-extraction functions ("f-e-f"s):

import operator
class Sorter(BaseCrawler):
 def __init__(self, base_dir):
 self.base_dir = base_dir
 def _sort(self, fef, *options):
 file_list = [(fn, fef(fn)) for fn in self.files(*options)]
 return sorted(file_list, key=operator.itemgetter(1))
 def by_size(self, *options):
 def fef(fn): return float(os.path.getsize(fn))
 return self._sort(fef, *options)
 def by_mtime(self, *options):
 return self._sort(os.path.getmtime, *options)

etc, etc.

Not sure why by_size needs that float rather than the natural int returned by os.path.getsize (I'd rather do the float cast, if needed, downstream), but I've kept that dubious choice to show how best to deal with a f-e-f that's not already a plain existing function (hint: not with a lambda:-).

Other improvements are feasible which don't deal with duplication but with efficiency. For example, consider:

def last_modified(self):
 '''
 Return the last modified file.
 '''
 return self.by_mtime()[0]

this does a lot of work (list building and sorting) that it doesn't need to do (and actually does not meet its name and docstring: it returns the file modified earliest, because the sort in by_mtime is in ascending order). How much simpler to do...:

def last_modified(self):
 '''
 Return the last modified file.
 '''
 return max(self.files(), key=os.path.getmtime)

(this one fixes the bug; to maintain the bug, use min in lieu of max).

Question 3

Since functions are objects too, you can create a mapping to use and replace the by_ functions all with one function.

sort_choices = { 'size': lambda fn: float(os.path.getsize(fn)),
 'mtime': lambda fn: os.path.getmtime(fn),
 # ...
 }
def sorted_by(self, by, *options):
 file_list = [(fn, sort_choices[by](fn)) for fn in self.files(*options)]
 file_list.sort(key=lambda fn: fn[1])
 return file_list

Question 4

Whenever you find yourself writing almost the same code over and over, it's a strong clue that what you're doing might just be a special case of something more general. In this case you've got a lot of methods like that. After looking all of them over and noting how they vary, you can often reduce each one to a single call to a new generic method you'll write which can be supplied with arguments to tell which variation of the processing it needs to perform.

I've done that your sample code and decided to add a new generic function named _apply() to perform the repetitious code. It has a leading underscore character to indicate it's a private method not meant to be called from outside the class. It's arguments are a function to call to get the path attribute, and lastly a list of options to pass on to the self.files() method not shown.

I did the same sort of analysis on the last_modified(), last_accessed(), and last_created() methods which resulted in the addition of another new generic method named _last_time() being added. Afterwards it enabled the rewriting of the latter them methods.

That kind approach wasn't applicable to the remaining average_fsize() and above_avg_size() methods, but it is possible to optimize them a bit as shown.

import operator
class Sorter(BaseCrawler):
 def __init__(self, base_dir):
 self.base_dir = base_dir
 # added
 def _apply(path_attribute, *options):
 '''
 Get and sort path attribute using supplied function
 '''
 file_list = [(fn, path_attribute(fn)) for fn in self.files(*options)]
 file_list.sort(key=operator.itemgetter(1)) # sort by the attributes
 return file_list
 # added
 def _last_time(self, path_time_attribute):
 '''
 Get lastest path time attribute using supplied function
 '''
 return max(self.files(), key=path_time_attribute)
 def by_size(self, *options):
 '''
 Sort files by size
 '''
 filesizes = self._apply(os.path.getsize, *options))
 # convert the file sizes to float before returning
 return [(fn, float(sz)) for fn, sz in filesizes)]
 def by_mtime(self, *options):
 '''
 Sort files by modified time.
 '''
 return self._apply(os.path.getmtime, *options)
 def by_ctime(self, *options):
 '''
 Sort files by creation time.
 '''
 return self._apply(os.path.getctime, *options)
 def by_atime(self, *options):
 '''
 Sort files by access time.
 '''
 return self._apply(os.path.getatime, *options)
 def last_modified(self):
 '''
 Return the last modified file.
 '''
 return self._last_time(os.path.getmtime)
 def last_accessed(self):
 '''
 Return the last accessed file.
 '''
 return self._last_time(os.path.getatime)
 def last_created(self):
 '''
 Return the last created file.
 '''
 return self._last_time(os.path.getctime)
 def average_fsize(self):
 '''
 Return the average file size.
 '''
 filesizes = c.by_size() # list of filename, size pairs
 return sum(fs[1] for fs in filesizes) / len(filesizes)
 def above_avg_size(self, times):
 '''
 Yield files that have an above average file size.
 '''
 # doesn't call self.average_fsize() to avoid getting a file sizes twice
 filesizes = c.by_size() # list of filename, size pairs
 avg = sum(fs[1] for fs in filesizes) / len(filesizes)
 for fn, sz in filesizes:
 if sz > avg * times:
 yield fn, sz

Question 5

You can, alas, generalize too much: if you have any options to by_atime, the first one will be used as argument result_type and thus called as a function, likely resulting in a crash.

Question 6

@Alex: Good point -- but fortunately fairly easy to fix in this instance by simply requiring callers -- which should all be defined within the class -- to always explicitly supply a value for the result_type argument. Despite the possibly of mistakes of this kind, wouldn't you agree that this is a valid, useful, (not to mention common approach)? It's what Author Martin Fowler, et al., call "refactoring" in their Refactoring: Improving the Design of Existing Code book.

Question 7

There are many forms of refactoring, and one can go overboard. All those Nones in almost all cases weigh down the code's legibility, and thus are an inferior alternative to the much sharper choice of just requiring callers to provide a feature extraction function as in my answer -- look at how much simpler my A's _sort is wrt yours' _apply! My approach ends up with one caller needing a local def (and it's dubious that even that is needed, as I still don't see why int is not just as good as float there!-). Root cause: "feature extraction function" has real semantics itself...

Question 8

@Alex: Yes, I see what you're saying and agree about maybe having gone too far on the abstraction by including the type conversion in it, too -- especially since it would be only be used in one somewhat nebulous situation. Removing it allows _apply to be simplified to point where it's pretty "legible" in my opinion -- so the superiority of what you did is less obvious IMHO. Encouraged by this, I went ahead and applied the approach to several more methods just for good measure. Thanks for your input -- all points duly noted.

Question 9

Getting closer and closer to my A, I see. So for example where I have return self._sort(os.path.getmtime, *options) you now have return self._apply(os.path.getmtime, *options) -- almost identical except that the name of the auxiliary method I used yesterday hints at the fact that it's chiefly responsible for sorting, while yours as of 15 minutes ago is extremely generic. In one other minor issue, your _apply is three lines including a bedraggled lambda -- mine is 33% smaller and uses faster, well-designed operator.itemgetter. Sorry, I still fail to see how your A is better...

Alex Martelli Alex Martelli 1504 bronze badges · Answer 1 · 2015-01-16 17:24:30Z

Removing duplication is a very worthy goal. Just "factor out" the feature-extraction functions ("f-e-f"s):

import operator
class Sorter(BaseCrawler):
 def __init__(self, base_dir):
 self.base_dir = base_dir
 def _sort(self, fef, *options):
 file_list = [(fn, fef(fn)) for fn in self.files(*options)]
 return sorted(file_list, key=operator.itemgetter(1))
 def by_size(self, *options):
 def fef(fn): return float(os.path.getsize(fn))
 return self._sort(fef, *options)
 def by_mtime(self, *options):
 return self._sort(os.path.getmtime, *options)

etc, etc.

Not sure why by_size needs that float rather than the natural int returned by os.path.getsize (I'd rather do the float cast, if needed, downstream), but I've kept that dubious choice to show how best to deal with a f-e-f that's not already a plain existing function (hint: not with a lambda:-).

Other improvements are feasible which don't deal with duplication but with efficiency. For example, consider:

def last_modified(self):
 '''
 Return the last modified file.
 '''
 return self.by_mtime()[0]

this does a lot of work (list building and sorting) that it doesn't need to do (and actually does not meet its name and docstring: it returns the file modified earliest, because the sort in by_mtime is in ascending order). How much simpler to do...:

def last_modified(self):
 '''
 Return the last modified file.
 '''
 return max(self.files(), key=os.path.getmtime)

(this one fixes the bug; to maintain the bug, use min in lieu of max).

Mark Ransom Mark Ransom 1,8762 gold badges12 silver badges15 bronze badges · Answer 2 · 2015-01-16 17:28:39Z

Since functions are objects too, you can create a mapping to use and replace the by_ functions all with one function.

sort_choices = { 'size': lambda fn: float(os.path.getsize(fn)),
 'mtime': lambda fn: os.path.getmtime(fn),
 # ...
 }
def sorted_by(self, by, *options):
 file_list = [(fn, sort_choices[by](fn)) for fn in self.files(*options)]
 file_list.sort(key=lambda fn: fn[1])
 return file_list

martineau martineau 2751 silver badge10 bronze badges · Answer 3 · 2015-01-16 19:23:01Z

Whenever you find yourself writing almost the same code over and over, it's a strong clue that what you're doing might just be a special case of something more general. In this case you've got a lot of methods like that. After looking all of them over and noting how they vary, you can often reduce each one to a single call to a new generic method you'll write which can be supplied with arguments to tell which variation of the processing it needs to perform.

I've done that your sample code and decided to add a new generic function named _apply() to perform the repetitious code. It has a leading underscore character to indicate it's a private method not meant to be called from outside the class. It's arguments are a function to call to get the path attribute, and lastly a list of options to pass on to the self.files() method not shown.

I did the same sort of analysis on the last_modified(), last_accessed(), and last_created() methods which resulted in the addition of another new generic method named _last_time() being added. Afterwards it enabled the rewriting of the latter them methods.

That kind approach wasn't applicable to the remaining average_fsize() and above_avg_size() methods, but it is possible to optimize them a bit as shown.

import operator
class Sorter(BaseCrawler):
 def __init__(self, base_dir):
 self.base_dir = base_dir
 # added
 def _apply(path_attribute, *options):
 '''
 Get and sort path attribute using supplied function
 '''
 file_list = [(fn, path_attribute(fn)) for fn in self.files(*options)]
 file_list.sort(key=operator.itemgetter(1)) # sort by the attributes
 return file_list
 # added
 def _last_time(self, path_time_attribute):
 '''
 Get lastest path time attribute using supplied function
 '''
 return max(self.files(), key=path_time_attribute)
 def by_size(self, *options):
 '''
 Sort files by size
 '''
 filesizes = self._apply(os.path.getsize, *options))
 # convert the file sizes to float before returning
 return [(fn, float(sz)) for fn, sz in filesizes)]
 def by_mtime(self, *options):
 '''
 Sort files by modified time.
 '''
 return self._apply(os.path.getmtime, *options)
 def by_ctime(self, *options):
 '''
 Sort files by creation time.
 '''
 return self._apply(os.path.getctime, *options)
 def by_atime(self, *options):
 '''
 Sort files by access time.
 '''
 return self._apply(os.path.getatime, *options)
 def last_modified(self):
 '''
 Return the last modified file.
 '''
 return self._last_time(os.path.getmtime)
 def last_accessed(self):
 '''
 Return the last accessed file.
 '''
 return self._last_time(os.path.getatime)
 def last_created(self):
 '''
 Return the last created file.
 '''
 return self._last_time(os.path.getctime)
 def average_fsize(self):
 '''
 Return the average file size.
 '''
 filesizes = c.by_size() # list of filename, size pairs
 return sum(fs[1] for fs in filesizes) / len(filesizes)
 def above_avg_size(self, times):
 '''
 Yield files that have an above average file size.
 '''
 # doesn't call self.average_fsize() to avoid getting a file sizes twice
 filesizes = c.by_size() # list of filename, size pairs
 avg = sum(fs[1] for fs in filesizes) / len(filesizes)
 for fn, sz in filesizes:
 if sz > avg * times:
 yield fn, sz

You can, alas, generalize too much: if you have any options to by_atime, the first one will be used as argument result_type and thus called as a function, likely resulting in a crash.
@Alex: Good point -- but fortunately fairly easy to fix in this instance by simply requiring callers -- which should all be defined within the class -- to always explicitly supply a value for the result_type argument. Despite the possibly of mistakes of this kind, wouldn't you agree that this is a valid, useful, (not to mention common approach)? It's what Author Martin Fowler, et al., call "refactoring" in their Refactoring: Improving the Design of Existing Code book.
There are many forms of refactoring, and one can go overboard. All those Nones in almost all cases weigh down the code's legibility, and thus are an inferior alternative to the much sharper choice of just requiring callers to provide a feature extraction function as in my answer -- look at how much simpler my A's _sort is wrt yours' _apply! My approach ends up with one caller needing a local def (and it's dubious that even that is needed, as I still don't see why int is not just as good as float there!-). Root cause: "feature extraction function" has real semantics itself...
@Alex: Yes, I see what you're saying and agree about maybe having gone too far on the abstraction by including the type conversion in it, too -- especially since it would be only be used in one somewhat nebulous situation. Removing it allows _apply to be simplified to point where it's pretty "legible" in my opinion -- so the superiority of what you did is less obvious IMHO. Encouraged by this, I went ahead and applied the approach to several more methods just for good measure. Thanks for your input -- all points duly noted.
Getting closer and closer to my A, I see. So for example where I have return self._sort(os.path.getmtime, *options) you now have return self._apply(os.path.getmtime, *options) -- almost identical except that the name of the auxiliary method I used yesterday hints at the fact that it's chiefly responsible for sorting, while yours as of 15 minutes ago is extremely generic. In one other minor issue, your _apply is three lines including a bedraggled lambda -- mine is 33% smaller and uses faster, well-designed operator.itemgetter. Sorry, I still fail to see how your A is better...

Stack Exchange Network

Displaying sorted results of a web crawl

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Displaying sorted results of a web crawl

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions