Is the code pythonic and easy to read and understand? Is there a better way to exclude directories? Did I miss any bugs?
'''Find large files on a linux system.
:Example:
excludes = ['/proc', '/sys', '/dev', '/swapfile']
find_larg_files('/', n_results=20, excludes=excludes)
'''
import os
import operator
from itertools import islice
import fnmatch
import re
import argparse
def take(number, iterable):
"""Return first n items of the iterable as a list.
:param number: `int` How many do you want to take.
:param iterable: `iterable`
:rtype: `list`
"""
return list(islice(iterable, number))
def walk(fpath, **kwargs):
''' Traverse thru a directory tree.
:param fpath: `int` The root file path
:param excludes: `list` optional directories to exclude
:rtype: `generator`
'''
kwargs.setdefault('excludes', [])
excludes = kwargs.get('excludes')
# transform glob patterns to regular expressions
excludes = r'|'.join([fnmatch.translate(x) for x in excludes]) or r'$.'
for root, dirs, files in os.walk(fpath, topdown=1):
# exclude dirs
if excludes:
dirs[:] = [os.path.join(root, d) for d in dirs]
dirs[:] = [d for d in dirs if not re.match(excludes, d)]
for name in files:
yield os.path.join(root, name)
def getsize(fpath):
''' Return the size of a file.
Will return 0 if an OSError is raised.
:param fpath: `str`
'''
try:
return os.path.getsize(fpath)
except OSError:
return 0
def find_larg_files(fpath, n_results=10, **kwargs):
''' Recursively find the largest files in a directory.
return n largest files in a directory tree.
:param fpath: `str` where to start.
:param n_results: `int` how many results to retrun.
:param kwargs: This will be passed to walk.
:rtype: `None` it prints the paths and sizes to the screen.
'''
results = {}
for name in walk(fpath, **kwargs):
results[name] = getsize(name)
results = reversed(sorted(results.items(), key=operator.itemgetter(1)))
for name, size in take(n_results, results):
print name, size
def main():
parser = argparse.ArgumentParser(description='Find large files')
parser.add_argument('path', metavar='-p', type=str, help='The path to start crawling')
parser.add_argument(
'--results', metavar='-n', type=int,
help='How many results to return the default is 10.',
default=10
)
parser.add_argument(
'--exclude', metavar='-e', type=list,
help='Directoris to exclude in the search.',
nargs='+', default=[]
)
args = parser.parse_args()
find_larg_files(args.path, n_results=args.results, excludes=args.exclude)
if __name__ == '__main__':
main()
2 Answers 2
Some observations:
- Use a
main
function and(削除) thepull out an argument parsing method to make this scriptable.argparse
module (削除ここまで) - "larg" should be "large";
getsize
should beget_size
,fpath
should befile_path
. Saving characters costs a lot of money in maintainability. - As @Edward wrote, don't keep the entire list of files in memory. As a counter-point I would have liked for the results limit to not exist - the default behaviour would then be simpler to implement, and would return every file in order of size.
- In addition to excluding directories you should make sure that you don't follow symbolic links, or you could end up with an infinite loop.
- The
OSError
in this case should be fatal - it indicates that your program is faulty since it doesn't properly filter files which you can't get the size of. I believe it could also happen if a file is not accessible, which means any result you get from this script might be inaccurate. - Replace
**kwargs
withexcludes=None
, followed by a simpleexcludes = [] if excludes is None else excludes
. That way the parameter list is unambiguous. Even easier, makeexcludes
mandatory (you are always using it anyway). YAGNI & KISS apply :) - It looks like your handling of exclusions doesn't anchor the regex at the start of the filename, so you'll end up excluding for example
/home/user/dev
. excludes
inwalk
is always truthy. You should either remove theor r'$.'
bit (best option IMO) or get rid ofif excludes:
.- You reuse
excludes
inwalk
to create a completely different list. This is a bad practice because it makes the use and content of a variable really hard to follow. Use separate variables for these.
-
\$\begingroup\$ I posted the wrong version i already added argparse \$\endgroup\$Ricky Wilson– Ricky Wilson2017年07月27日 12:35:42 +00:00Commented Jul 27, 2017 at 12:35
-
1\$\begingroup\$ The
OSError
could be non-fatal, as the filename may disappear between the time of reading the directory and processing the names found therein. \$\endgroup\$Toby Speight– Toby Speight2017年07月27日 13:22:31 +00:00Commented Jul 27, 2017 at 13:22 -
\$\begingroup\$ @TobySpeight That's a good caveat, and should probably be a special case. I wouldn't include it in version 0.1, but it should definitely be handled by 1.0. \$\endgroup\$l0b0– l0b02017年07月27日 13:24:13 +00:00Commented Jul 27, 2017 at 13:24
Sme remarks
Seperate the functions: The function doing the walking also handles the arguments, so as @l0b0 suggest. I would have
find_larg_files
yield the results, and letmain
take care of the printing. That way you can reuse itI prefer
pathlib.Path
overos.path
. That provides a lot of what you want to do out-of-the-box:.rglob('*')
for the walking,.is_file()
to check whether it exists and is a regular file,.stat().st_size
for the sizesorted
has areverse
keyword
Just out of curiosity and because I might need something like this later, I made a SortedLimitedMapping
class SortedLimitedMapping:
def __init__(self, items, maxlen, descending=True):
"""
Some kind of sorted mapping with a maximum length.
Sort order is standard sort-order of the (key, value)tuples
Duplicate keys are permitted, as long as the corresponding values are sortable
items can be a dict or a collection of (key, value) tuples
"""
self._maxlen = maxlen
self.descending = descending
self._data = []
self.add_items(items)
def __setitem__(self, key, value):
# print(f'adding {key, value}')
data_to_add = (key, value)
if len(self._data) < self.maxlen:
self._data.append(data_to_add)
else:
old_min = min(self._data) if self.descending else max(self._data)
if (old_min < data_to_add) == self.descending:
new_data = self._data.copy()
new_data.remove(old_min)
new_data.append(data_to_add)
try:
sorted(new_data)
self._data = new_data
except TypeError as e:
raise ValueError(f'can not sort {data_to_add} in {self}') from e
def add_items(self, items):
if isinstance(items, dict):
items = items.items()
for key, value in items:
self[key] = value
@property
def maxlen(self):
return self._maxlen
@maxlen.setter
def maxlen(self, maxlen):
self._data = self.data[:maxlen]
self._maxlen = maxlen
def __iter__(self):
for key, value in self.data:
yield key, value
def __repr__(self):
return f'SortedLimitedMapping({self.data}, maxlen={self.maxlen}, descending={self.descending})'
@property
def data(self):
return sorted(self._data, reverse=self.descending)
@property
def keys(self):
return (i[0] for i in self.data)
@property
def values(self):
return (i[1] for i in self.data)
def copy(self):
return SortedLimitedMapping(items=self.data, maxlen=self.maxlen, descending=self.descending)
def __copy__(self):
return self.copy()
def __eq__(self, other):
if not isinstance(other, SortedLimitedMapping):
raise TypeError(f'SortedLimitedMapping only compares to SortedLimitedMapping, not to {type(other)}')
if self.maxlen != other.maxlen:
return False
if self.descending != other.descending:
return False
return sorted(self._data) == sorted(other._data)
def __len__(self):
return len(self._data)
n
from the complete list, an alternative approach would be to only keep the largestn
as you go. For really huge collections of files this could make a difference. \$\endgroup\$