Note: this code will be run using portable WinPython 3.7.6.0 (Python 3.7.6, package list here) on Win10x64. Installing new packages or upgrading the Python version are not an option for this project.
Disclosure: this code is heavily inspired by Janos' excellent answer here.
I've written a piece of crap that's going to grow a lot over time, so I thought I'd get the base reviewed instead of posting everything at once. After all, at this stage I can still easily scrap half of it without problems. Current version has already been a great help.
Situation
At work, we keep a lot of datasheets from the parts we use. The library of parts ever grows and the datasheets are updated by the manufacturer without notice. For some of our parts we're obliged to keep the documentation in order and to prevent accidentally missing relevant docs we keep the datasheets of everything even remotely electrical (including push-buttons, LEDs, etc.). Oh, and in 3 languages per part. Which means I'm talking about a couple thousand documents for some of our suppliers.
Directory structure looks like this:
|> current lib
|> supplier1
part1_de.pdf
part1_en.pdf
part1_nl.pdf
part2_de.pdf
part2_en.pdf
part2_nl.pdf
|> supplier2
Etc.
So if we have a 6ES7132-6BF01-0BA0 from Siemens, we have:
|> current lib
|> Siemens
6ES7132-6BF01-0BA0_de.pdf
6ES7132-6BF01-0BA0_en.pdf
6ES7132-6BF01-0BA0_nl.pdf
There are a couple of things that can go wrong with this:
- Missing files (only
_de
available instead of all 3). - Files not signed (
6ES71326BF010BA0_de.pdf
instead of6ES7132-6BF01-0BA0_de.pdf
). - Invalid language-separator signs (older files have the language separated by
-
instead of_
). - Old language (older Dutch files are named
_ne
instead of_nl
). - Incorrect language (we're not interested in
_it
). - Incorrect/missing extension (all files should be
.pdf
). - Outdated documents (the older it gets, the more likely the data has been adjusted).
This comes pretty exact, since the files are externally referenced on an unfortunately hard-coded list. Back-ups are taken care of, so the directory only has to contain recent files without worrying about the archive.
I'm still working on the first 2 points of that list (I can check against a CSV containing all the items) but that's for next version. Those are out-of-scope for this question but perhaps something to keep in mind. The rest is covered in this one by explicitly white-listing the last couple of characters.
Approach
Set a couple of globals, create a white-list, iterate over the files and classify each file as either 'Correct'
, 'Invalid'
or 'OutOfDate'
(the latter based on time modified (mtime, creation date is unreliable). Return both the results and the total count of checked files. Print in a readable format.
Currently both result
and count
are used for manual reference, later the result
will be passed to the next function. One supplier will be checked at a time. Current code considers files older than 90 days out-of-date. Actual list of suppliers is 20+ entries. I'm currently not automatically iterating over all suppliers because for some different standards apply. Those are exempt (come to think of it, perhaps I should put those on a blacklist and iterate over the list but that still seems risky). Current code works as expected for this revision.
Code
import os
import pprint
import time
MAX_AGE = 90
# 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day
MINUTES = 60
HOURS = 60*MINUTES
DAYS = 24*HOURS
PATH_TO_FILES = "path/to/local/copy/of/lib/"
EXT = ".pdf"
WHITELIST = ['_nl' + EXT, '_en' + EXT, '_de' + EXT]
SUPPLIERS = ['Siemens', 'IFM']
def classify_files(basedir, limit):
"""
Classify files in *basedir* not modified within *limit*
:param basedir: directory to search
:param limit: timelimit for treshold
Return result (dict) and count (int).
"""
report = {
'Correct' : [],
'Invalid' : [],
'OutOfDate': []
}
mod_time_limit = time.time() - limit
count = 0
for filename in os.listdir(basedir):
path = os.path.join(basedir, filename)
count += 1
if not filename[-7:] in WHITELIST:
report['Invalid'].append(filename)
elif os.path.getmtime(path) < mod_time_limit:
report['OutOfDate'].append(filename)
else:
report['Correct'].append(filename)
return report, count
def main():
for supplier in SUPPLIERS:
print("\n{}:".format(supplier))
result, count = classify_files(
PATH_TO_FILES + supplier,
MAX_AGE * DAYS
)
pprint.pprint(result)
print("{} files checked".format(count))
if __name__ == '__main__':
main()
Issues
One of the most glaring issues is the hardcoded 7
here:
if not filename[-7:] in WHITELIST:
I'm not a fan of entire if/else
structure. Feels off. And I should probably use a different method of returning the result. How I handle times feels hacky, but it's a lot more straight-forward than converting and messing-around with datetime
. I should probably be using Correct
/Incorrect
or Valid
/Invalid
instead of the current Correct
/Invalid
combination, naming is hard. The count looks ugly (perhaps summing the length of the resulting lists would be better), and I'm not sure what validations and exception-handlers would make sense here.
I know the keys of report
are probably not according to PEP8 and the extra white-space definitely isn't, but it looks better this way.
Speed is currently not a concern. Naming, maintainability, readability and extendability are priority. Currently configuration by globals is fine, when it's clear what the final thing is going to look like they will be replaced by either config-files or arguments.
Complicated one-liners are sub-optimal. My Python might be good enough to understand it, but it would be preferable if colleagues less proficient can follow it as well.
2 Answers 2
- Your code is small, but there's a lot going on that I find it hard to understand. Given that this is just the beginning, the complexity of the code will just increase as time goes on.
I found the large amount of globals to hinder readability, as each time I came across another I had to scroll to the top of the code to find out what it is.
Whilst I prefer not to have globals and hide them away in classes. I think there are two options when using a class:
- Make a
Classifier
class that just has those globals defined on the class as class variables. - Make a
Classifier
class that is provided the values at instantiation. And utilizetyping
to convey what the values contain.
- Make a
Whilst I find a class helps reduce complexity. It should also increase general maintainability and readability. Currently your function is a bug/feature hotspot, and when you add more classifiers to the function the complexity will only increase.
Splitting the different validation checks into their own methods, or functions, helps break down the function and make the code small palatable chunks. And so increases readability.
It also increases maintainability as when you need to add additional functionality to the function, you instead have a small self-contained function. Which is simpler to re-learn, when coming back to the function after some time. In the changes I made the validation checks each became a method containing a single
if
andreturn
.I'm currently not automatically iterating over all suppliers because for some different standards apply.
This also increases customization, as you can easily define multiple classes with the different validation checks. Utilizing either inheritance or mixins to simplify the code.
Since the
Classifier
class will have lots of functions that will be called in the same way, I would opt to make aBaseClassifier
class that magically handles everything.- Since I would pass values in at instantiation I would define the
__init__
and simply assign the value to the instance. - Since we'll have lots of functions, to reduce the amount of WET I would automagically find the functions to run using
dir
andgetattr
. - Since the functions will all be called in the same way, defining a simple
validate
function can reduce the amount of code to write on each classifier.
- Since I would pass values in at instantiation I would define the
I personally am not a fan of magic strings for your categories.
Rather than magic strings you can use
enum.Enum
orenum.IntFlag
to better categorize your paths.enum.Enum
can be used exactly as you are now. This means that you can only have one error per path.enum.IntFlag
allows you to report multiple errors at the same time.I have a personal distaste for programs that hide errors from me. It means that you have to run the checker multiple times fixing the previous errors to only then get more. Rather than stating all the problems and fixing everything in one swoop.
os.path
is largely obsolete bypathlib
. Changing your code to usepathlib
increases readability of the validators as you can easily validate if the extension is correct using:if path.suffix == EXT:
It also allows you to easily join paths using
path / str
.I would suggest using
logging
rather thanprint
. This sets up your application to have correct and easily togglable debugging output. It also means that this can be turned off when in production when not debugging.- I would prefer to have
WHITELIST
be a set. I see no need for it to be a list.
Now, my solution is hands down more complex. It's adding enum
, a Classifier
class, a BaseClassifier
class and is a significant increase in lines. And so it is susceptible to be an increase in maintainability costs. However in my opinion the benefits that I've described above outweigh the additional costs.
Currently both result and count are used for manual reference, later the result will be passed to the next function.
Given that you've not shown how the results are being used I have provided two classify_files
functions. One that outputs the same as your currently expecting. And one that output all the paths grouped by all their errors.
import enum
import logging
import os
import pathlib
import pprint
import time
from typing import Any, Callable, ClassVar, Dict, Generic, Iterator, List, Optional, Set, Tuple, TypeVar
TItem = TypeVar('TItem')
TFlag = TypeVar('TFlag')
MAX_AGE = 90
# 60 seconds in a minute, 60 minutes in an hour, 24 hours in a day
MINUTES = 60
HOURS = 60*MINUTES
DAYS = 24*HOURS
PATH_TO_FILES = pathlib.Path("./lib")
EXT = ".pdf"
LANGUAGES = {'_nl', '_en', '_de'}
SUPPLIERS = ['Siemens', 'IFM']
class BaseClassifier(Generic[TItem, TFlag]):
"""
Classifier metaclass to ease classifying values.
This is a fairly magical class that allows setting the required state.
It also allows validating against all validators starting with `_valid_`,
without having to manually call each of them.
"""
DEFAULT: ClassVar[TFlag]
def __init__(self, **kwargs: Any):
"""
Initialize classifier state.
Values must be passed as keywords so they can be set correctly
on the instance. For each keyword provided an attribute with the
same name is created and has the value set to the keywords value.
For example we can set the attribute `foo` by passing it as a keyword.
>>> bc = BaseClassifier(foo='bar')
>>> bc.foo
'bar'
"""
for name, value in kwargs.items():
setattr(self, name, value)
def _get_validators(self) -> List[Callable[[TItem], Optional[TFlag]]]:
"""
Get all validators on the instance.
All validators must start `_valid_` and so we loop through all
the names of the attributes on the instance, via `dir`, and
filter to ones that start with `_valid_`. For each name that
starts with `_valid_` we return the value of the attribute.
For example `Classifier` has the following validators.
Note that the test is for the function names, but it actually
returns the full fat function. Just getting the test to work
with the full fat functions is a PITA.
>>> validators = Classifier()._get_validators()
>>> [validator.__name__ for validator in validators]
['_valid_age', '_valid_extension', '_valid_language']
"""
return [
getattr(self, name)
for name in dir(self)
if name.startswith('_valid_')
]
def validate(self, items: Iterator[TItem]) -> Iterator[Tuple[TFlag, TItem]]:
"""
Validate input against the instance's validators.
For each provided item we run all validators to build a complete
picture of any problems with the item. Once all validators have
ran we output the item as is, but with any statuses that the
validators have assigned to it.
For example the test below shows that 123_it.txt has both an
invalid language and file type as it's flag is FileState.INVALID.
>>> classifier = Classifier(\
ext=EXT,\
languages=LANGUAGES,\
time_limit=time.time() - (MAX_AGE * DAYS),\
)
>>> list(classifier.validate([pathlib.Path('./lib/IFM/123_it.txt')]))
[(<FileState.INVALID: 3>, WindowsPath('lib/IFM/123_it.txt'))]
"""
functions = self._get_validators()
for item in items:
ret = self.DEFAULT
for function in functions:
output = function(item)
if output is not None:
ret |= output
yield ret, item
class FileState(enum.IntFlag):
"""States that files can be in."""
CORRECT = 0
EXTENSION = enum.auto()
LANGUAGE = enum.auto()
OUT_OF_DATE = enum.auto()
# A convenience for your old categories.
# only used in the backward-compatible `classify_paths`
INVALID = EXTENSION | LANGUAGE
class Classifier(BaseClassifier[pathlib.Path, FileState]):
"""Classifiers for files."""
__slots__ = ('ext', 'languages', 'time_limit')
DEFAULT: ClassVar[FileState] = FileState.CORRECT
ext: str
languages: Set[str]
time_limit: int
def _valid_extension(self, path: pathlib.Path) -> Optional[FileState]:
"""Validate if path has correct extension."""
if path.suffix != self.ext:
return FileState.EXTENSION
def _valid_language(self, path: pathlib.Path) -> Optional[FileState]:
"""Validate if path has correct language."""
if path.stem[-3:] not in self.languages:
return FileState.LANGUAGE
def _valid_age(self, path: pathlib.Path) -> Optional[FileState]:
"""Validate if file is out of date."""
if path.stat().st_mtime < self.time_limit:
return FileState.OUT_OF_DATE
# Simpler method of grouping categories
# This utilizes the `enum` we defined earlier.
def classify_paths_simple(
classifier: BaseClassifier,
paths: Iterator[pathlib.Path],
) -> Tuple[Dict[FileState, List[pathlib.Path]], int]:
"""Group classifications."""
report: Dict[FileState, List[pathlib.Path]] = {}
for count, (state, path) in enumerate(classifier.validate(paths)):
report.setdefault(state, []).append(path)
return report, count
# Complex as I make the output exactly the same as the question.
def classify_paths(
classifier: BaseClassifier,
paths: Iterator[pathlib.Path],
) -> Tuple[Dict[str, List[pathlib.Path]], int]:
"""Group classifications into generic custom types."""
report: Dict[str, List[pathlib.Path]] = {
'Correct': [],
'Invalid': [],
'OutOfDate': [],
}
for count, (state, path) in enumerate(classifier.validate(paths)):
if state == FileState.CORRECT:
key = 'Correct'
elif state & FileState.INVALID:
key = 'Invalid'
else:
key = 'OutOfDate'
report[key].append(path)
return report, count
def main():
classifier = Classifier(
ext=EXT,
languages=LANGUAGES,
time_limit=time.time() - (MAX_AGE * DAYS),
)
for supplier in SUPPLIERS:
logging.debug(supplier)
# Also use classify_paths_simple to contrast the outputs
result, count = classify_paths(
classifier,
(PATH_TO_FILES / supplier).iterdir()
)
pprint.pprint(result)
logging.info("%s files checked", count)
if __name__ == '__main__':
main()
import doctest
doctest.testmod()
Towards better functionality and data structures
prefer
f-string
formatting over multiple strings+
- concatenation:f'_nl{EXT}', f'_en{EXT}', f'_de{EXT}'
define constants
WHITELIST
andSUPPLIERS
as immutable data structures to avoid potential/accidental compromising by subsequent callers.
Furthermore, makingWHITELIST
a tuple helps solving the mentioned issue "filename[-7:] in WHITELIST
" (see next points below)WHITELIST = (f'_nl{EXT}', f'_en{EXT}', f'_de{EXT}') SUPPLIERS = ('Siemens', 'IFM')
consistent naming
Correct/Incorrect
orValid/Invalid
seems sound and perceivable. I would go with one of them
classify_files
function:
instead of hard-coded slicing in
if not filename[-7:] in WHITELIST
- use convenientstr.endswith
approach since we've already prepared the appropriateWHITELIST
tuple:... if not filename.endswith(WHITELIST): report['Incorrect'].append(filename)
replace redundant
count = 0
and subsequent incrementcount += 1
withenumerate
feature starting from1
:for i, filename in enumerate(os.listdir(basedir), 1): ...
After the loop has completed the variable (counter)
i
will contain the number of iterated/checked files and can be easily used/referenced in return statementreturn report, i
.
-
\$\begingroup\$ Ah, much better. \$\endgroup\$2020年01月21日 12:36:31 +00:00Commented Jan 21, 2020 at 12:36