File classifier based on extension and date modified

Question 1

Note: this code will be run using portable WinPython 3.7.6.0 (Python 3.7.6, package list here) on Win10x64. Installing new packages or upgrading the Python version are not an option for this project.

Disclosure: this code is heavily inspired by Janos' excellent answer here.

I've written a piece of crap that's going to grow a lot over time, so I thought I'd get the base reviewed instead of posting everything at once. After all, at this stage I can still easily scrap half of it without problems. Current version has already been a great help.

Situation

At work, we keep a lot of datasheets from the parts we use. The library of parts ever grows and the datasheets are updated by the manufacturer without notice. For some of our parts we're obliged to keep the documentation in order and to prevent accidentally missing relevant docs we keep the datasheets of everything even remotely electrical (including push-buttons, LEDs, etc.). Oh, and in 3 languages per part. Which means I'm talking about a couple thousand documents for some of our suppliers.

Directory structure looks like this:

|> current lib
 |> supplier1
 part1_de.pdf
 part1_en.pdf
 part1_nl.pdf
 part2_de.pdf
 part2_en.pdf
 part2_nl.pdf
 |> supplier2

Etc.

So if we have a 6ES7132-6BF01-0BA0 from Siemens, we have:

|> current lib
 |> Siemens
 6ES7132-6BF01-0BA0_de.pdf
 6ES7132-6BF01-0BA0_en.pdf
 6ES7132-6BF01-0BA0_nl.pdf

There are a couple of things that can go wrong with this:

Missing files (only _de available instead of all 3).
Files not signed (6ES71326BF010BA0_de.pdf instead of 6ES7132-6BF01-0BA0_de.pdf).
Invalid language-separator signs (older files have the language separated by - instead of _).
Old language (older Dutch files are named _ne instead of _nl).
Incorrect language (we're not interested in _it).
Incorrect/missing extension (all files should be .pdf).
Outdated documents (the older it gets, the more likely the data has been adjusted).

This comes pretty exact, since the files are externally referenced on an unfortunately hard-coded list. Back-ups are taken care of, so the directory only has to contain recent files without worrying about the archive.

I'm still working on the first 2 points of that list (I can check against a CSV containing all the items) but that's for next version. Those are out-of-scope for this question but perhaps something to keep in mind. The rest is covered in this one by explicitly white-listing the last couple of characters.

Approach

Set a couple of globals, create a white-list, iterate over the files and classify each file as either 'Correct', 'Invalid' or 'OutOfDate' (the latter based on time modified (mtime, creation date is unreliable). Return both the results and the total count of checked files. Print in a readable format.

Currently both result and count are used for manual reference, later the result will be passed to the next function. One supplier will be checked at a time. Current code considers files older than 90 days out-of-date. Actual list of suppliers is 20+ entries. I'm currently not automatically iterating over all suppliers because for some different standards apply. Those are exempt (come to think of it, perhaps I should put those on a blacklist and iterate over the list but that still seems risky). Current code works as expected for this revision.

Code

import os
import pprint
import time
MAX_AGE = 90
# 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day
MINUTES = 60
HOURS = 60*MINUTES
DAYS = 24*HOURS
PATH_TO_FILES = "path/to/local/copy/of/lib/"
EXT = ".pdf"
WHITELIST = ['_nl' + EXT, '_en' + EXT, '_de' + EXT]
SUPPLIERS = ['Siemens', 'IFM']
def classify_files(basedir, limit):
 """
 Classify files in *basedir* not modified within *limit*
 :param basedir: directory to search
 :param limit: timelimit for treshold
 Return result (dict) and count (int).
 """
 report = {
 'Correct' : [],
 'Invalid' : [],
 'OutOfDate': []
 }
 mod_time_limit = time.time() - limit
 count = 0
 for filename in os.listdir(basedir):
 path = os.path.join(basedir, filename)
 count += 1
 if not filename[-7:] in WHITELIST:
 report['Invalid'].append(filename)
 elif os.path.getmtime(path) < mod_time_limit:
 report['OutOfDate'].append(filename)
 else:
 report['Correct'].append(filename)
 return report, count
def main():
 for supplier in SUPPLIERS:
 print("\n{}:".format(supplier))
 result, count = classify_files(
 PATH_TO_FILES + supplier,
 MAX_AGE * DAYS
 )
 pprint.pprint(result)
 print("{} files checked".format(count))
if __name__ == '__main__':
 main()

Issues

One of the most glaring issues is the hardcoded 7 here:

if not filename[-7:] in WHITELIST:

I'm not a fan of entire if/else structure. Feels off. And I should probably use a different method of returning the result. How I handle times feels hacky, but it's a lot more straight-forward than converting and messing-around with datetime. I should probably be using Correct/Incorrect or Valid/Invalid instead of the current Correct/Invalid combination, naming is hard. The count looks ugly (perhaps summing the length of the resulting lists would be better), and I'm not sure what validations and exception-handlers would make sense here.

I know the keys of report are probably not according to PEP8 and the extra white-space definitely isn't, but it looks better this way.

Speed is currently not a concern. Naming, maintainability, readability and extendability are priority. Currently configuration by globals is fine, when it's clear what the final thing is going to look like they will be replaced by either config-files or arguments.

Complicated one-liners are sub-optimal. My Python might be good enough to understand it, but it would be preferable if colleagues less proficient can follow it as well.

Question 2

Your code is small, but there's a lot going on that I find it hard to understand. Given that this is just the beginning, the complexity of the code will just increase as time goes on.
I found the large amount of globals to hinder readability, as each time I came across another I had to scroll to the top of the code to find out what it is.

Whilst I prefer not to have globals and hide them away in classes. I think there are two options when using a class:
- Make a Classifier class that just has those globals defined on the class as class variables.
- Make a Classifier class that is provided the values at instantiation. And utilize typing to convey what the values contain.
Whilst I find a class helps reduce complexity. It should also increase general maintainability and readability. Currently your function is a bug/feature hotspot, and when you add more classifiers to the function the complexity will only increase.

Splitting the different validation checks into their own methods, or functions, helps break down the function and make the code small palatable chunks. And so increases readability.

It also increases maintainability as when you need to add additional functionality to the function, you instead have a small self-contained function. Which is simpler to re-learn, when coming back to the function after some time. In the changes I made the validation checks each became a method containing a single if and return.

I'm currently not automatically iterating over all suppliers because for some different standards apply.

This also increases customization, as you can easily define multiple classes with the different validation checks. Utilizing either inheritance or mixins to simplify the code.
Since the Classifier class will have lots of functions that will be called in the same way, I would opt to make a BaseClassifier class that magically handles everything.
- Since I would pass values in at instantiation I would define the __init__ and simply assign the value to the instance.
- Since we'll have lots of functions, to reduce the amount of WET I would automagically find the functions to run using dir and getattr.
- Since the functions will all be called in the same way, defining a simple validate function can reduce the amount of code to write on each classifier.
I personally am not a fan of magic strings for your categories.

Rather than magic strings you can use enum.Enum or enum.IntFlag to better categorize your paths.
- enum.Enum can be used exactly as you are now. This means that you can only have one error per path.
- enum.IntFlag allows you to report multiple errors at the same time.
  
  I have a personal distaste for programs that hide errors from me. It means that you have to run the checker multiple times fixing the previous errors to only then get more. Rather than stating all the problems and fixing everything in one swoop.
os.path is largely obsolete by pathlib. Changing your code to use pathlib increases readability of the validators as you can easily validate if the extension is correct using:
```
if path.suffix == EXT:
```
It also allows you to easily join paths using path / str.
I would suggest using logging rather than print. This sets up your application to have correct and easily togglable debugging output. It also means that this can be turned off when in production when not debugging.
I would prefer to have WHITELIST be a set. I see no need for it to be a list.

Now, my solution is hands down more complex. It's adding enum, a Classifier class, a BaseClassifier class and is a significant increase in lines. And so it is susceptible to be an increase in maintainability costs. However in my opinion the benefits that I've described above outweigh the additional costs.

Currently both result and count are used for manual reference, later the result will be passed to the next function.

Given that you've not shown how the results are being used I have provided two classify_files functions. One that outputs the same as your currently expecting. And one that output all the paths grouped by all their errors.

import enum
import logging
import os
import pathlib
import pprint
import time
from typing import Any, Callable, ClassVar, Dict, Generic, Iterator, List, Optional, Set, Tuple, TypeVar
TItem = TypeVar('TItem')
TFlag = TypeVar('TFlag')
MAX_AGE = 90
# 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day
MINUTES = 60
HOURS = 60*MINUTES
DAYS = 24*HOURS
PATH_TO_FILES = pathlib.Path("./lib")
EXT = ".pdf"
LANGUAGES = {'_nl', '_en', '_de'}
SUPPLIERS = ['Siemens', 'IFM']
class BaseClassifier(Generic[TItem, TFlag]):
 """
 Classifier metaclass to ease classifying values.
 This is a fairly magical class that allows setting the required state.
 It also allows validating against all validators starting with `_valid_`,
 without having to manually call each of them.
 """
 DEFAULT: ClassVar[TFlag]
 def __init__(self, **kwargs: Any):
 """
 Initialize classifier state.
 Values must be passed as keywords so they can be set correctly
 on the instance. For each keyword provided an attribute with the
 same name is created and has the value set to the keywords value.
 For example we can set the attribute `foo` by passing it as a keyword.
 >>> bc = BaseClassifier(foo='bar')
 >>> bc.foo
 'bar'
 """
 for name, value in kwargs.items():
 setattr(self, name, value)
 def _get_validators(self) -> List[Callable[[TItem], Optional[TFlag]]]:
 """
 Get all validators on the instance.
 All validators must start `_valid_` and so we loop through all
 the names of the attributes on the instance, via `dir`, and
 filter to ones that start with `_valid_`. For each name that
 starts with `_valid_` we return the value of the attribute.
 For example `Classifier` has the following validators.
 Note that the test is for the function names, but it actually
 returns the full fat function. Just getting the test to work
 with the full fat functions is a PITA.
 >>> validators = Classifier()._get_validators()
 >>> [validator.__name__ for validator in validators]
 ['_valid_age', '_valid_extension', '_valid_language']
 """
 return [
 getattr(self, name)
 for name in dir(self)
 if name.startswith('_valid_')
 ]
 def validate(self, items: Iterator[TItem]) -> Iterator[Tuple[TFlag, TItem]]:
 """
 Validate input against the instance's validators.
 For each provided item we run all validators to build a complete 
 picture of any problems with the item. Once all validators have
 ran we output the item as is, but with any statuses that the
 validators have assigned to it.
 For example the test below shows that 123_it.txt has both an
 invalid language and file type as it's flag is FileState.INVALID.
 >>> classifier = Classifier(\
 ext=EXT,\
 languages=LANGUAGES,\
 time_limit=time.time() - (MAX_AGE * DAYS),\
 )
 >>> list(classifier.validate([pathlib.Path('./lib/IFM/123_it.txt')]))
 [(<FileState.INVALID: 3>, WindowsPath('lib/IFM/123_it.txt'))]
 """
 functions = self._get_validators()
 for item in items:
 ret = self.DEFAULT
 for function in functions:
 output = function(item)
 if output is not None:
 ret |= output
 yield ret, item
class FileState(enum.IntFlag):
 """States that files can be in."""
 CORRECT = 0
 EXTENSION = enum.auto()
 LANGUAGE = enum.auto()
 OUT_OF_DATE = enum.auto()
 # A convenience for your old categories.
 # only used in the backward-compatible `classify_paths`
 INVALID = EXTENSION | LANGUAGE
class Classifier(BaseClassifier[pathlib.Path, FileState]):
 """Classifiers for files."""
 __slots__ = ('ext', 'languages', 'time_limit')
 DEFAULT: ClassVar[FileState] = FileState.CORRECT
 ext: str
 languages: Set[str]
 time_limit: int
 def _valid_extension(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if path has correct extension."""
 if path.suffix != self.ext:
 return FileState.EXTENSION
 def _valid_language(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if path has correct language."""
 if path.stem[-3:] not in self.languages:
 return FileState.LANGUAGE
 def _valid_age(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if file is out of date."""
 if path.stat().st_mtime < self.time_limit:
 return FileState.OUT_OF_DATE
# Simpler method of grouping categories
# This utilizes the `enum` we defined earlier.
def classify_paths_simple(
 classifier: BaseClassifier,
 paths: Iterator[pathlib.Path],
) -> Tuple[Dict[FileState, List[pathlib.Path]], int]:
 """Group classifications."""
 report: Dict[FileState, List[pathlib.Path]] = {}
 for count, (state, path) in enumerate(classifier.validate(paths)):
 report.setdefault(state, []).append(path)
 return report, count
# Complex as I make the output exactly the same as the question.
def classify_paths(
 classifier: BaseClassifier,
 paths: Iterator[pathlib.Path],
) -> Tuple[Dict[str, List[pathlib.Path]], int]:
 """Group classifications into generic custom types."""
 report: Dict[str, List[pathlib.Path]] = {
 'Correct': [],
 'Invalid': [],
 'OutOfDate': [],
 }
 for count, (state, path) in enumerate(classifier.validate(paths)):
 if state == FileState.CORRECT:
 key = 'Correct'
 elif state & FileState.INVALID:
 key = 'Invalid'
 else:
 key = 'OutOfDate'
 report[key].append(path)
 return report, count
def main():
 classifier = Classifier(
 ext=EXT,
 languages=LANGUAGES,
 time_limit=time.time() - (MAX_AGE * DAYS),
 )
 for supplier in SUPPLIERS:
 logging.debug(supplier)
 # Also use classify_paths_simple to contrast the outputs
 result, count = classify_paths(
 classifier,
 (PATH_TO_FILES / supplier).iterdir()
 )
 pprint.pprint(result)
 logging.info("%s files checked", count)
if __name__ == '__main__':
 main()
 import doctest
 doctest.testmod()

Question 3

Towards better functionality and data structures

prefer f-string formatting over multiple strings +- concatenation:
```
f'_nl{EXT}', f'_en{EXT}', f'_de{EXT}'
```
define constants WHITELIST and SUPPLIERS as immutable data structures to avoid potential/accidental compromising by subsequent callers.
Furthermore, making WHITELIST a tuple helps solving the mentioned issue "filename[-7:] in WHITELIST" (see next points below)
```
WHITELIST = (f'_nl{EXT}', f'_en{EXT}', f'_de{EXT}')
SUPPLIERS = ('Siemens', 'IFM')
```
consistent naming Correct/Incorrect or Valid/Invalid seems sound and perceivable. I would go with one of them

classify_files function:

instead of hard-coded slicing in if not filename[-7:] in WHITELIST - use convenient str.endswith approach since we've already prepared the appropriate WHITELIST tuple:
```
...
if not filename.endswith(WHITELIST):
 report['Incorrect'].append(filename)
```
replace redundant count = 0 and subsequent increment count += 1 with enumerate feature starting from 1:
```
for i, filename in enumerate(os.listdir(basedir), 1):
 ... 
```
After the loop has completed the variable (counter) i will contain the number of iterated/checked files and can be easily used/referenced in return statement return report, i.

Question 4

Ah, much better.

score 4 · Accepted Answer · 2020-01-21 13:35:48Z

Your code is small, but there's a lot going on that I find it hard to understand. Given that this is just the beginning, the complexity of the code will just increase as time goes on.
I found the large amount of globals to hinder readability, as each time I came across another I had to scroll to the top of the code to find out what it is.

Whilst I prefer not to have globals and hide them away in classes. I think there are two options when using a class:
- Make a Classifier class that just has those globals defined on the class as class variables.
- Make a Classifier class that is provided the values at instantiation. And utilize typing to convey what the values contain.
Whilst I find a class helps reduce complexity. It should also increase general maintainability and readability. Currently your function is a bug/feature hotspot, and when you add more classifiers to the function the complexity will only increase.

Splitting the different validation checks into their own methods, or functions, helps break down the function and make the code small palatable chunks. And so increases readability.

It also increases maintainability as when you need to add additional functionality to the function, you instead have a small self-contained function. Which is simpler to re-learn, when coming back to the function after some time. In the changes I made the validation checks each became a method containing a single if and return.

I'm currently not automatically iterating over all suppliers because for some different standards apply.

This also increases customization, as you can easily define multiple classes with the different validation checks. Utilizing either inheritance or mixins to simplify the code.
Since the Classifier class will have lots of functions that will be called in the same way, I would opt to make a BaseClassifier class that magically handles everything.
- Since I would pass values in at instantiation I would define the __init__ and simply assign the value to the instance.
- Since we'll have lots of functions, to reduce the amount of WET I would automagically find the functions to run using dir and getattr.
- Since the functions will all be called in the same way, defining a simple validate function can reduce the amount of code to write on each classifier.
I personally am not a fan of magic strings for your categories.

Rather than magic strings you can use enum.Enum or enum.IntFlag to better categorize your paths.
- enum.Enum can be used exactly as you are now. This means that you can only have one error per path.
- enum.IntFlag allows you to report multiple errors at the same time.
  
  I have a personal distaste for programs that hide errors from me. It means that you have to run the checker multiple times fixing the previous errors to only then get more. Rather than stating all the problems and fixing everything in one swoop.
os.path is largely obsolete by pathlib. Changing your code to use pathlib increases readability of the validators as you can easily validate if the extension is correct using:
```
if path.suffix == EXT:
```
It also allows you to easily join paths using path / str.
I would suggest using logging rather than print. This sets up your application to have correct and easily togglable debugging output. It also means that this can be turned off when in production when not debugging.
I would prefer to have WHITELIST be a set. I see no need for it to be a list.

Now, my solution is hands down more complex. It's adding enum, a Classifier class, a BaseClassifier class and is a significant increase in lines. And so it is susceptible to be an increase in maintainability costs. However in my opinion the benefits that I've described above outweigh the additional costs.

Currently both result and count are used for manual reference, later the result will be passed to the next function.

Given that you've not shown how the results are being used I have provided two classify_files functions. One that outputs the same as your currently expecting. And one that output all the paths grouped by all their errors.

import enum
import logging
import os
import pathlib
import pprint
import time
from typing import Any, Callable, ClassVar, Dict, Generic, Iterator, List, Optional, Set, Tuple, TypeVar
TItem = TypeVar('TItem')
TFlag = TypeVar('TFlag')
MAX_AGE = 90
# 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day
MINUTES = 60
HOURS = 60*MINUTES
DAYS = 24*HOURS
PATH_TO_FILES = pathlib.Path("./lib")
EXT = ".pdf"
LANGUAGES = {'_nl', '_en', '_de'}
SUPPLIERS = ['Siemens', 'IFM']
class BaseClassifier(Generic[TItem, TFlag]):
 """
 Classifier metaclass to ease classifying values.
 This is a fairly magical class that allows setting the required state.
 It also allows validating against all validators starting with `_valid_`,
 without having to manually call each of them.
 """
 DEFAULT: ClassVar[TFlag]
 def __init__(self, **kwargs: Any):
 """
 Initialize classifier state.
 Values must be passed as keywords so they can be set correctly
 on the instance. For each keyword provided an attribute with the
 same name is created and has the value set to the keywords value.
 For example we can set the attribute `foo` by passing it as a keyword.
 >>> bc = BaseClassifier(foo='bar')
 >>> bc.foo
 'bar'
 """
 for name, value in kwargs.items():
 setattr(self, name, value)
 def _get_validators(self) -> List[Callable[[TItem], Optional[TFlag]]]:
 """
 Get all validators on the instance.
 All validators must start `_valid_` and so we loop through all
 the names of the attributes on the instance, via `dir`, and
 filter to ones that start with `_valid_`. For each name that
 starts with `_valid_` we return the value of the attribute.
 For example `Classifier` has the following validators.
 Note that the test is for the function names, but it actually
 returns the full fat function. Just getting the test to work
 with the full fat functions is a PITA.
 >>> validators = Classifier()._get_validators()
 >>> [validator.__name__ for validator in validators]
 ['_valid_age', '_valid_extension', '_valid_language']
 """
 return [
 getattr(self, name)
 for name in dir(self)
 if name.startswith('_valid_')
 ]
 def validate(self, items: Iterator[TItem]) -> Iterator[Tuple[TFlag, TItem]]:
 """
 Validate input against the instance's validators.
 For each provided item we run all validators to build a complete 
 picture of any problems with the item. Once all validators have
 ran we output the item as is, but with any statuses that the
 validators have assigned to it.
 For example the test below shows that 123_it.txt has both an
 invalid language and file type as it's flag is FileState.INVALID.
 >>> classifier = Classifier(\
 ext=EXT,\
 languages=LANGUAGES,\
 time_limit=time.time() - (MAX_AGE * DAYS),\
 )
 >>> list(classifier.validate([pathlib.Path('./lib/IFM/123_it.txt')]))
 [(<FileState.INVALID: 3>, WindowsPath('lib/IFM/123_it.txt'))]
 """
 functions = self._get_validators()
 for item in items:
 ret = self.DEFAULT
 for function in functions:
 output = function(item)
 if output is not None:
 ret |= output
 yield ret, item
class FileState(enum.IntFlag):
 """States that files can be in."""
 CORRECT = 0
 EXTENSION = enum.auto()
 LANGUAGE = enum.auto()
 OUT_OF_DATE = enum.auto()
 # A convenience for your old categories.
 # only used in the backward-compatible `classify_paths`
 INVALID = EXTENSION | LANGUAGE
class Classifier(BaseClassifier[pathlib.Path, FileState]):
 """Classifiers for files."""
 __slots__ = ('ext', 'languages', 'time_limit')
 DEFAULT: ClassVar[FileState] = FileState.CORRECT
 ext: str
 languages: Set[str]
 time_limit: int
 def _valid_extension(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if path has correct extension."""
 if path.suffix != self.ext:
 return FileState.EXTENSION
 def _valid_language(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if path has correct language."""
 if path.stem[-3:] not in self.languages:
 return FileState.LANGUAGE
 def _valid_age(self, path: pathlib.Path) -> Optional[FileState]:
 """Validate if file is out of date."""
 if path.stat().st_mtime < self.time_limit:
 return FileState.OUT_OF_DATE
# Simpler method of grouping categories
# This utilizes the `enum` we defined earlier.
def classify_paths_simple(
 classifier: BaseClassifier,
 paths: Iterator[pathlib.Path],
) -> Tuple[Dict[FileState, List[pathlib.Path]], int]:
 """Group classifications."""
 report: Dict[FileState, List[pathlib.Path]] = {}
 for count, (state, path) in enumerate(classifier.validate(paths)):
 report.setdefault(state, []).append(path)
 return report, count
# Complex as I make the output exactly the same as the question.
def classify_paths(
 classifier: BaseClassifier,
 paths: Iterator[pathlib.Path],
) -> Tuple[Dict[str, List[pathlib.Path]], int]:
 """Group classifications into generic custom types."""
 report: Dict[str, List[pathlib.Path]] = {
 'Correct': [],
 'Invalid': [],
 'OutOfDate': [],
 }
 for count, (state, path) in enumerate(classifier.validate(paths)):
 if state == FileState.CORRECT:
 key = 'Correct'
 elif state & FileState.INVALID:
 key = 'Invalid'
 else:
 key = 'OutOfDate'
 report[key].append(path)
 return report, count
def main():
 classifier = Classifier(
 ext=EXT,
 languages=LANGUAGES,
 time_limit=time.time() - (MAX_AGE * DAYS),
 )
 for supplier in SUPPLIERS:
 logging.debug(supplier)
 # Also use classify_paths_simple to contrast the outputs
 result, count = classify_paths(
 classifier,
 (PATH_TO_FILES / supplier).iterdir()
 )
 pprint.pprint(result)
 logging.info("%s files checked", count)
if __name__ == '__main__':
 main()
 import doctest
 doctest.testmod()

Stack Exchange Network

File classifier based on extension and date modified

Situation

Approach

Code

Issues

2 Answers 2

Towards better functionality and data structures

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

File classifier based on extension and date modified

Situation

Approach

Code

Issues

2 Answers 2

Towards better functionality and data structures

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions