My goal was to create a simple-to-use CLI program for drawing directory trees (what tree
does on most platforms, basically). I'm submitting my 'backend' program for review because I think it makes more sense to review the actual algorithm, rather than reviewing an interface on top of it.
import os
import re
def tree(path, indentation=4, ignore_hidden=True, regex=None):
"""Build a directory tree starting at `path`. See also
https://en.wikipedia.org/wiki/File:Files11_directory_hierarchy.svg
Arguments
* path: A relative or absolute path to use as starting point
for the tree. [type: str]
* indentation: The amount of spaces to indent each level with.
Defaults to 4. [type: int]
* ignore_hidden: Ignore hidden files (starting with a period).
Defaults to True. [type: boolean]
* regex: If set, only matching files will be shown.
Set to None to show all files. Defaults to None.
[type: str / None]
Returns
* A directory tree [type: str]
Example
Assuming we're in the 'home' directory:
>>> import tree
>>> t = tree.tree(path="./")
>>> t
./images
foo.jpg
bar.jpg
./images/personal
spam.png
./documents
eggs.docx
>>> t = tree.tree(path="./", ignore_hidden=False)
>>> t
.hidden.txt
./images
foo.jpg
bar.jpg
.hidden.jpg
./images/personal
spam.png
./documents
eggs.docx
>>> t = tree.tree("./", 8, regex="(.*\.png)")
>>> t
./images
./images/personal
spam.png
./documents
"""
structure = "\n"
tab = " "*indentation
if regex is None:
for root, directories, files in os.walk(path):
try:
depth = root.count(os.sep)
offset = tab * depth
structure += offset + root + "\n"
except OSError:
continue
for f in files:
if ignore_hidden and f.startswith("."):
continue
try:
depth = os.path.join(root, f).count(os.sep)
offset = tab * depth
structure += offset + f + "\n"
except OSError:
continue
else:
restriction = re.compile(regex)
for root, directories, files in os.walk(path):
try:
depth = root.count(os.sep)
offset = tab * depth
structure += offset + root + "\n"
except OSError:
continue
for f in files:
if ignore_hidden and f.startswith("."):
continue
if not re.match(restriction, f):
continue
try:
depth = os.path.join(root, f).count(os.sep)
offset = tab * depth
structure += offset + f + "\n"
except OSError:
continue
structure = structure.split("\n")
dedented = list(map(lambda e: e.replace(tab, "", 1), structure))[1:]
return "\n".join(dedented)
I believe my code runs in \$O(n)\$ complexity and running it on my Android phone on /
takes about 9 seconds:
time python -c "import tree; tree.tree('/')"
real 0m9.365s
user 0m7.390s
sys 0m1.630s
I'd appreciate any feedback regarding performance, usability, coding style, documentation, or anything else!
1 Answer 1
According to whether regex is None, you run code segment A or B. This feels like a copy-n-paste "bolt-on" to add the regex feature.
I suggest that you may be able to refactor A & B into a common piece of code that pays attention to regex
and Does The Right Thing.
You didn't publish any automated unit tests that exercise this code. Putting together some unit tests would help you to refactor while verifying that functionality is unchanged.
You have some KPIs, and you mentioned 9s wall clock time, but did not mention how many directory entries were parsed in those 9 seconds. It would be interesting to know what sort of inputs this code expects, and where a profiler says the code spends most of its time.
-
\$\begingroup\$ Thank you for your response! I split the function into two blocks because I figured checking for an
re.match
in afor
-loop's a real waste of time if no regex was passed in the first place. Can you recommend another way of approaching that? \$\endgroup\$Daniel– Daniel2017年08月09日 06:47:39 +00:00Commented Aug 9, 2017 at 6:47 -
\$\begingroup\$ Profile it first, then optimize for
regex is None
or whatever. I'd expect sys calls for os.walk() to dominate the usermode CPU time that checks a regex. Anyway, consider refactoring the common tree walking code into a generator function that repeatedlyyield
s file entries. \$\endgroup\$J_H– J_H2017年08月09日 20:20:50 +00:00Commented Aug 9, 2017 at 20:20 -
\$\begingroup\$ In
dedented
you strip TABs. If you're looking to save a few CPU cycles, perhaps you'd prefer not to insert TABs in the first place. Also,replace()
is more aggressive thanre.compile(r'^ +')
orlstrip()
-- for example a filename might be "READ[4 spaces]ME.txt". \$\endgroup\$J_H– J_H2017年08月09日 21:04:55 +00:00Commented Aug 9, 2017 at 21:04