Command-Line tool to find large files on a Linux system

Question 1

Is the code pythonic and easy to read and understand? Is there a better way to exclude directories? Did I miss any bugs?

'''Find large files on a linux system.
:Example:
excludes = ['/proc', '/sys', '/dev', '/swapfile']
find_larg_files('/', n_results=20, excludes=excludes)
'''
import os
import operator
from itertools import islice
import fnmatch
import re
import argparse
def take(number, iterable):
 """Return first n items of the iterable as a list.
 :param number: `int` How many do you want to take.
 :param iterable: `iterable`
 :rtype: `list`
 """
 return list(islice(iterable, number))
def walk(fpath, **kwargs):
 ''' Traverse thru a directory tree.
 :param fpath: `int` The root file path
 :param excludes: `list` optional directories to exclude
 :rtype: `generator`
 '''
 kwargs.setdefault('excludes', [])
 excludes = kwargs.get('excludes')
 # transform glob patterns to regular expressions
 excludes = r'|'.join([fnmatch.translate(x) for x in excludes]) or r'$.'
 for root, dirs, files in os.walk(fpath, topdown=1):
 # exclude dirs
 if excludes:
 dirs[:] = [os.path.join(root, d) for d in dirs]
 dirs[:] = [d for d in dirs if not re.match(excludes, d)]
 for name in files:
 yield os.path.join(root, name)
def getsize(fpath):
 ''' Return the size of a file.
 Will return 0 if an OSError is raised.
 :param fpath: `str` 
 ''' 
 try:
 return os.path.getsize(fpath)
 except OSError:
 return 0
def find_larg_files(fpath, n_results=10, **kwargs):
 ''' Recursively find the largest files in a directory.
 return n largest files in a directory tree.
 :param fpath: `str` where to start.
 :param n_results: `int` how many results to retrun.
 :param kwargs: This will be passed to walk.
 :rtype: `None` it prints the paths and sizes to the screen.
 '''
 results = {}
 for name in walk(fpath, **kwargs):
 results[name] = getsize(name)
 results = reversed(sorted(results.items(), key=operator.itemgetter(1)))
 for name, size in take(n_results, results):
 print name, size
def main():
 parser = argparse.ArgumentParser(description='Find large files')
 parser.add_argument('path', metavar='-p', type=str, help='The path to start crawling')
 parser.add_argument(
 '--results', metavar='-n', type=int,
 help='How many results to return the default is 10.',
 default=10
 )
 parser.add_argument(
 '--exclude', metavar='-e', type=list,
 help='Directoris to exclude in the search.',
 nargs='+', default=[]
 )
 args = parser.parse_args()
 find_larg_files(args.path, n_results=args.results, excludes=args.exclude)
if __name__ == '__main__':
 main()

Question 2

One comment on efficiency: instead of picking the largest n from the complete list, an alternative approach would be to only keep the largest n as you go. For really huge collections of files this could make a difference.

Question 3

Some observations:

Use a main function and ~~(削除) the argparse module (削除ここまで)~~ pull out an argument parsing method to make this scriptable.
"larg" should be "large"; getsize should be get_size, fpath should be file_path. Saving characters costs a lot of money in maintainability.
As @Edward wrote, don't keep the entire list of files in memory. As a counter-point I would have liked for the results limit to not exist - the default behaviour would then be simpler to implement, and would return every file in order of size.
In addition to excluding directories you should make sure that you don't follow symbolic links, or you could end up with an infinite loop.
The OSError in this case should be fatal - it indicates that your program is faulty since it doesn't properly filter files which you can't get the size of. I believe it could also happen if a file is not accessible, which means any result you get from this script might be inaccurate.
Replace **kwargs with excludes=None, followed by a simple excludes = [] if excludes is None else excludes. That way the parameter list is unambiguous. Even easier, make excludes mandatory (you are always using it anyway). YAGNI & KISS apply :)
It looks like your handling of exclusions doesn't anchor the regex at the start of the filename, so you'll end up excluding for example /home/user/dev.
excludes in walk is always truthy. You should either remove the or r'$.' bit (best option IMO) or get rid of if excludes:.
You reuse excludes in walk to create a completely different list. This is a bad practice because it makes the use and content of a variable really hard to follow. Use separate variables for these.

Question 4

I posted the wrong version i already added argparse

Question 5

The OSError could be non-fatal, as the filename may disappear between the time of reading the directory and processing the names found therein.

Question 6

@TobySpeight That's a good caveat, and should probably be a special case. I wouldn't include it in version 0.1, but it should definitely be handled by 1.0.

Question 7

Sme remarks

Seperate the functions: The function doing the walking also handles the arguments, so as @l0b0 suggest. I would have find_larg_files yield the results, and let main take care of the printing. That way you can reuse it
I prefer pathlib.Path over os.path. That provides a lot of what you want to do out-of-the-box: .rglob('*') for the walking, .is_file() to check whether it exists and is a regular file, .stat().st_size for the size
sorted has a reverse keyword

Just out of curiosity and because I might need something like this later, I made a SortedLimitedMapping

class SortedLimitedMapping:
 def __init__(self, items, maxlen, descending=True):
 """
 Some kind of sorted mapping with a maximum length.
 Sort order is standard sort-order of the (key, value)tuples
 Duplicate keys are permitted, as long as the corresponding values are sortable
 items can be a dict or a collection of (key, value) tuples
 """
 self._maxlen = maxlen
 self.descending = descending
 self._data = []
 self.add_items(items)
 def __setitem__(self, key, value):
 # print(f'adding {key, value}')
 data_to_add = (key, value)
 if len(self._data) < self.maxlen:
 self._data.append(data_to_add)
 else:
 old_min = min(self._data) if self.descending else max(self._data)
 if (old_min < data_to_add) == self.descending:
 new_data = self._data.copy()
 new_data.remove(old_min)
 new_data.append(data_to_add)
 try:
 sorted(new_data)
 self._data = new_data
 except TypeError as e:
 raise ValueError(f'can not sort {data_to_add} in {self}') from e
 def add_items(self, items):
 if isinstance(items, dict):
 items = items.items()
 for key, value in items:
 self[key] = value
 @property
 def maxlen(self):
 return self._maxlen
 @maxlen.setter
 def maxlen(self, maxlen):
 self._data = self.data[:maxlen]
 self._maxlen = maxlen
 def __iter__(self):
 for key, value in self.data:
 yield key, value
 def __repr__(self):
 return f'SortedLimitedMapping({self.data}, maxlen={self.maxlen}, descending={self.descending})'
 @property
 def data(self):
 return sorted(self._data, reverse=self.descending)
 @property
 def keys(self):
 return (i[0] for i in self.data)
 @property
 def values(self):
 return (i[1] for i in self.data)
 def copy(self):
 return SortedLimitedMapping(items=self.data, maxlen=self.maxlen, descending=self.descending)
 def __copy__(self):
 return self.copy()
 def __eq__(self, other):
 if not isinstance(other, SortedLimitedMapping):
 raise TypeError(f'SortedLimitedMapping only compares to SortedLimitedMapping, not to {type(other)}')
 if self.maxlen != other.maxlen:
 return False
 if self.descending != other.descending:
 return False
 return sorted(self._data) == sorted(other._data)
 def __len__(self):
 return len(self._data)

l0b0 l0b0l0b0 9,10722 silver badges36 bronze badges · Answer 1 · 2017-07-27 12:32:45Z

Some observations:

Use a main function and ~~(削除) the argparse module (削除ここまで)~~ pull out an argument parsing method to make this scriptable.
"larg" should be "large"; getsize should be get_size, fpath should be file_path. Saving characters costs a lot of money in maintainability.
As @Edward wrote, don't keep the entire list of files in memory. As a counter-point I would have liked for the results limit to not exist - the default behaviour would then be simpler to implement, and would return every file in order of size.
In addition to excluding directories you should make sure that you don't follow symbolic links, or you could end up with an infinite loop.
The OSError in this case should be fatal - it indicates that your program is faulty since it doesn't properly filter files which you can't get the size of. I believe it could also happen if a file is not accessible, which means any result you get from this script might be inaccurate.
Replace **kwargs with excludes=None, followed by a simple excludes = [] if excludes is None else excludes. That way the parameter list is unambiguous. Even easier, make excludes mandatory (you are always using it anyway). YAGNI & KISS apply :)
It looks like your handling of exclusions doesn't anchor the regex at the start of the filename, so you'll end up excluding for example /home/user/dev.
excludes in walk is always truthy. You should either remove the or r'$.' bit (best option IMO) or get rid of if excludes:.
You reuse excludes in walk to create a completely different list. This is a bad practice because it makes the use and content of a variable really hard to follow. Use separate variables for these.

The OSError could be non-fatal, as the filename may disappear between the time of reading the directory and processing the names found therein.
@TobySpeight That's a good caveat, and should probably be a special case. I wouldn't include it in version 0.1, but it should definitely be handled by 1.0.

score 2 · Answer 2 · 2017-07-27 12:46:48Z

Sme remarks

Seperate the functions: The function doing the walking also handles the arguments, so as @l0b0 suggest. I would have find_larg_files yield the results, and let main take care of the printing. That way you can reuse it
I prefer pathlib.Path over os.path. That provides a lot of what you want to do out-of-the-box: .rglob('*') for the walking, .is_file() to check whether it exists and is a regular file, .stat().st_size for the size
sorted has a reverse keyword

Just out of curiosity and because I might need something like this later, I made a SortedLimitedMapping

class SortedLimitedMapping:
 def __init__(self, items, maxlen, descending=True):
 """
 Some kind of sorted mapping with a maximum length.
 Sort order is standard sort-order of the (key, value)tuples
 Duplicate keys are permitted, as long as the corresponding values are sortable
 items can be a dict or a collection of (key, value) tuples
 """
 self._maxlen = maxlen
 self.descending = descending
 self._data = []
 self.add_items(items)
 def __setitem__(self, key, value):
 # print(f'adding {key, value}')
 data_to_add = (key, value)
 if len(self._data) < self.maxlen:
 self._data.append(data_to_add)
 else:
 old_min = min(self._data) if self.descending else max(self._data)
 if (old_min < data_to_add) == self.descending:
 new_data = self._data.copy()
 new_data.remove(old_min)
 new_data.append(data_to_add)
 try:
 sorted(new_data)
 self._data = new_data
 except TypeError as e:
 raise ValueError(f'can not sort {data_to_add} in {self}') from e
 def add_items(self, items):
 if isinstance(items, dict):
 items = items.items()
 for key, value in items:
 self[key] = value
 @property
 def maxlen(self):
 return self._maxlen
 @maxlen.setter
 def maxlen(self, maxlen):
 self._data = self.data[:maxlen]
 self._maxlen = maxlen
 def __iter__(self):
 for key, value in self.data:
 yield key, value
 def __repr__(self):
 return f'SortedLimitedMapping({self.data}, maxlen={self.maxlen}, descending={self.descending})'
 @property
 def data(self):
 return sorted(self._data, reverse=self.descending)
 @property
 def keys(self):
 return (i[0] for i in self.data)
 @property
 def values(self):
 return (i[1] for i in self.data)
 def copy(self):
 return SortedLimitedMapping(items=self.data, maxlen=self.maxlen, descending=self.descending)
 def __copy__(self):
 return self.copy()
 def __eq__(self, other):
 if not isinstance(other, SortedLimitedMapping):
 raise TypeError(f'SortedLimitedMapping only compares to SortedLimitedMapping, not to {type(other)}')
 if self.maxlen != other.maxlen:
 return False
 if self.descending != other.descending:
 return False
 return sorted(self._data) == sorted(other._data)
 def __len__(self):
 return len(self._data)

Stack Exchange Network

Command-Line tool to find large files on a Linux system

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Command-Line tool to find large files on a Linux system

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions