I need to iterate over a large (in the thousands) list of files in a network folder and "concatenate" any image sequences into just one entry with the range of images (first image, last image in sequence). I am calling this "sequence pruning" and I created the following code which does work, but it seems incredibly un-pythonic to me and runs slowly. I'm certain there is a MUCH better way to do this, so I am looking for review to help clean/speed this up.
To elaborate a bit further on the issue lets say I have this as a list of files for the input:
img.001.png
img.002.png
img.003.png
img_other.001.png
random_file.txt
yet another seq.0000.png
yet another seq.0001.png
yet another seq.0002.png
yet another seq.0021.png
yet another seq.0030.png
In the end I want to return something like this:
img.001.png, [1-3]
img_other.001.png
random_file.txt
yet another seq.0000.png, [0-30]
FYI I can safely assume that the image sequence number is always going to be a series of digits at the very end of the filename (before the extension of course). However I cannot assume that they will be perfectly sequential, as there are sometimes "gaps" between numbers.
Here's my current code, python 2.7:
import os
def split_padding(path):
file, ext = os.path.splitext(path)
pad_int = 0
while file[pad_int * -1 - 1].isdigit():
pad_int += 1
if pad_int == 0:
return file, '0', ext
clean_file = file[0:pad_int * -1]
padding = file[pad_int * -1:]
return clean_file, padding, ext
def strip_padding(path):
file, ext = os.path.splitext(path)
while file[-1].isdigit():
file = file[:-1]
return file
def prune_files(paths):
'''
sequences get put into arrays like so:
[x_folder, z_folder, [test_a.000.png, 0, 2], [test_b.000.tif, 0, 3], test_C.000.png]
:return: [file1, file2, [first_file, seq_start, seq_end]]
'''
paths.sort(key=lambda s: s.lower()) # list has to be sorted for this to work
# this odd bit of code turns all sequences into arrays of images.
pruned_list = []
switch = True
for c, path in enumerate(paths):
if c == 0:
pruned_list.append(path)
continue
if not os.path.splitext(path)[1] in ['.png', '.tif', '.tiff', '.exr', '.jpg', '.jpeg']:
pruned_list.append(path)
continue
test = paths[c-1]
if strip_padding(path) == strip_padding(test):
if switch:
pruned_list[-1] = [pruned_list[-1]]
switch = False
pruned_list[-1].append(path)
else:
pruned_list.append(path)
switch = True
# so now lets convert that to the format we want to return
for c, item in enumerate(pruned_list):
if type(item) == list:
pruned_list[c] = [item[0], int(split_padding(item[0])[1]), int(split_padding(item[-1])[1])]
return pruned_list
if __name__ == "__main__":
test_dir = "some directory"
print prune_files([path for path in os.listdir(test_dir) if os.path.isfile(os.path.join(test_dir, path))])
1 Answer 1
Many, many problems:
- Low level dealing with strings. Python is a high-level programming language - don't reinvent the wheel. Regex and string methods now are replacing your low-level code in my rework of it. Good rule of thumb is to avoid any code that has a lot of array indexing in Python.
- Sorting doesn't need a function if you use
sorted
- Long comprehensions and generators are usually unreadable and unmaintainable so avoid those at all costs. Especially avoid adding needless logic in them too.
- Extension list is a constant so you might as well extract that to the top.
- Simpler or more standard output could have made the code much simpler but it wasn't clear if that was the requirement.
- Hardcoding args shouldn't be done when
sys.argv
andargparse
are so easy to use.
tl;dr fixed code. Code could be a lot simpler if the output format was a bit more standard but you can probably modify the code with minimal effort:
#!/usr/bin/env python3
import os
import re
import sys
FILE_EXTS = [
'exr',
'jpeg',
'jpg',
'png',
'tif',
'tiff',
]
SEQUENCE_PATTERN = r'(.*)\.([0-9]+).(.{3,4})$'
def sequences_strigifier(sequences):
output_string = ''
for key, seq_info in sequences.items():
if not seq_info:
output_string += '{}\n'.format(key)
continue
if seq_info['start_index'] == seq_info['end_index']:
output_string += '{}.{}.{}\n'.format(key,
seq_info['start_index_str'],
seq_info['ext'])
continue
output_string += '{}.{}.{}, [{}-{}]\n'.format(key,
seq_info['start_index_str'],
seq_info['ext'],
seq_info['start_index'],
seq_info['end_index'])
return output_string.strip()
def find_image_sequences(directory):
'''
sequences get put into arrays like so:
[x_folder, z_folder, [test_a.000.png, 0, 2], [test_b.000.tif, 0, 3], test_C.000.png]
:return: [file1, file2, [first_file, seq_start, seq_end]]
'''
sequences = {}
sorted_candidate_list = sorted(os.listdir(directory))
for candidate_path in sorted_candidate_list:
full_candidate_path = os.path.join(directory, candidate_path)
if not os.path.isfile(full_candidate_path):
sequences[candidate_path] = None
continue
matches = re.match(SEQUENCE_PATTERN, candidate_path)
if not matches:
sequences[candidate_path] = None
continue
filename = matches.group(1)
sequence_index = matches.group(2)
extension = matches.group(3)
if not extension in FILE_EXTS:
sequences[candidate_path] = None
continue
if not filename in sequences:
sequences[filename] = {
'ext': extension,
'start_index_str': sequence_index,
'start_index': int(sequence_index),
'end_index': int(sequence_index),
}
continue
sequences[filename]['end_index'] = int(sequence_index)
return sequences_strigifier(sequences)
if __name__ == '__main__':
if len(sys.argv) < 2:
print('Usage: {} <dirname>'.format(sys.argv[0]))
exit(1)
print(find_image_sequences(sys.argv[1]))
-
\$\begingroup\$ Just a note that for a large set, the stringifier will be pretty slow so I would either print directly out of there or use join()/StringIO. \$\endgroup\$Srdjan Grubor– Srdjan Grubor2018年05月07日 18:33:38 +00:00Commented May 7, 2018 at 18:33
-
\$\begingroup\$ @Coal_ re usage format: Indeed. fixed! Good point on the exit as well. \$\endgroup\$Srdjan Grubor– Srdjan Grubor2018年05月08日 13:33:22 +00:00Commented May 8, 2018 at 13:33
-
\$\begingroup\$ Thanks Srdjan, lots of improvement! Sorry if I was unclear about the usage of this code, I do not want a string as the output so the stringifier part is unnecessary (dictionary works for me). And sys.arv is not needed because this being used as a module within my larger program, and will be called directly. I took your code with some tweaks and inserted it into my original Q, let me know what you think. The use of a dictionary for 'sequences', and regex, definitely cleans things up! However this runs only a tad faster than my original code, any ideas to speed it up? \$\endgroup\$Spencer– Spencer2018年05月08日 19:30:35 +00:00Commented May 8, 2018 at 19:30
-
1\$\begingroup\$ Please see What to do when someone answers . I have rolled back Rev 5 → 3. \$\endgroup\$200_success– 200_success2018年05月09日 03:49:30 +00:00Commented May 9, 2018 at 3:49
-
1\$\begingroup\$ For the record I posted this regex problem on SO and got some good answers here: stackoverflow.com/questions/50311032/… \$\endgroup\$Spencer– Spencer2018年05月12日 22:40:45 +00:00Commented May 12, 2018 at 22:40
prune_files()
currently does not take into account file extensions, only name and sequence numbers. Is that the expected behavior? \$\endgroup\$