4
\$\begingroup\$

I need to iterate over a large (in the thousands) list of files in a network folder and "concatenate" any image sequences into just one entry with the range of images (first image, last image in sequence). I am calling this "sequence pruning" and I created the following code which does work, but it seems incredibly un-pythonic to me and runs slowly. I'm certain there is a MUCH better way to do this, so I am looking for review to help clean/speed this up.

To elaborate a bit further on the issue lets say I have this as a list of files for the input:

img.001.png
img.002.png
img.003.png
img_other.001.png
random_file.txt
yet another seq.0000.png
yet another seq.0001.png
yet another seq.0002.png
yet another seq.0021.png
yet another seq.0030.png

In the end I want to return something like this:

img.001.png, [1-3]
img_other.001.png
random_file.txt
yet another seq.0000.png, [0-30]

FYI I can safely assume that the image sequence number is always going to be a series of digits at the very end of the filename (before the extension of course). However I cannot assume that they will be perfectly sequential, as there are sometimes "gaps" between numbers.

Here's my current code, python 2.7:

import os
def split_padding(path):
 file, ext = os.path.splitext(path)
 pad_int = 0
 while file[pad_int * -1 - 1].isdigit():
 pad_int += 1
 if pad_int == 0:
 return file, '0', ext
 clean_file = file[0:pad_int * -1]
 padding = file[pad_int * -1:]
 return clean_file, padding, ext
def strip_padding(path):
 file, ext = os.path.splitext(path)
 while file[-1].isdigit():
 file = file[:-1]
 return file
def prune_files(paths):
 '''
 sequences get put into arrays like so:
 [x_folder, z_folder, [test_a.000.png, 0, 2], [test_b.000.tif, 0, 3], test_C.000.png]
 :return: [file1, file2, [first_file, seq_start, seq_end]]
 '''
 paths.sort(key=lambda s: s.lower()) # list has to be sorted for this to work
 # this odd bit of code turns all sequences into arrays of images.
 pruned_list = []
 switch = True
 for c, path in enumerate(paths):
 if c == 0:
 pruned_list.append(path)
 continue
 if not os.path.splitext(path)[1] in ['.png', '.tif', '.tiff', '.exr', '.jpg', '.jpeg']:
 pruned_list.append(path)
 continue
 test = paths[c-1]
 if strip_padding(path) == strip_padding(test):
 if switch:
 pruned_list[-1] = [pruned_list[-1]]
 switch = False
 pruned_list[-1].append(path)
 else:
 pruned_list.append(path)
 switch = True
 # so now lets convert that to the format we want to return
 for c, item in enumerate(pruned_list):
 if type(item) == list:
 pruned_list[c] = [item[0], int(split_padding(item[0])[1]), int(split_padding(item[-1])[1])]
 return pruned_list
if __name__ == "__main__":
 test_dir = "some directory"
 print prune_files([path for path in os.listdir(test_dir) if os.path.isfile(os.path.join(test_dir, path))])
200_success
145k22 gold badges190 silver badges478 bronze badges
asked May 5, 2018 at 18:02
\$\endgroup\$
5
  • \$\begingroup\$ Can you assume the numbers will be sorted at all? Is it possible that 'img.003.png' comes after 'random_file.txt', or are the file names sorted alphabetically too? \$\endgroup\$ Commented May 5, 2018 at 18:59
  • \$\begingroup\$ The input list comes from os.listdir(), so I get whatever that returns. From my experience it isn't always sorted so I opted to do my own sort in the beginning of my prune_files def. I realize that was perhaps confusing in my example, so I just fixed it! \$\endgroup\$ Commented May 5, 2018 at 19:03
  • \$\begingroup\$ I'm just playing around a bit to see how performance could be improved, and noticed prune_files() currently does not take into account file extensions, only name and sequence numbers. Is that the expected behavior? \$\endgroup\$ Commented May 6, 2018 at 13:18
  • \$\begingroup\$ @Coal_ Ha that would explain an error I got yesterday. No, not really intended so I'll have to fix that. As a note I found that looping through the list and stripping out the sequence numbers, then converting to a set to remove duplicates is a very rapid way of getting rid of everything but one entry per sequence. However at that point you still don't have first and last file... \$\endgroup\$ Commented May 6, 2018 at 23:31
  • \$\begingroup\$ It would now be best to ask a new question with the changes incorporated, since you have accepted an answer. \$\endgroup\$ Commented May 8, 2018 at 21:56

1 Answer 1

3
\$\begingroup\$

Many, many problems:

  • Low level dealing with strings. Python is a high-level programming language - don't reinvent the wheel. Regex and string methods now are replacing your low-level code in my rework of it. Good rule of thumb is to avoid any code that has a lot of array indexing in Python.
  • Sorting doesn't need a function if you use sorted
  • Long comprehensions and generators are usually unreadable and unmaintainable so avoid those at all costs. Especially avoid adding needless logic in them too.
  • Extension list is a constant so you might as well extract that to the top.
  • Simpler or more standard output could have made the code much simpler but it wasn't clear if that was the requirement.
  • Hardcoding args shouldn't be done when sys.argv and argparse are so easy to use.

tl;dr fixed code. Code could be a lot simpler if the output format was a bit more standard but you can probably modify the code with minimal effort:

#!/usr/bin/env python3
import os
import re
import sys
FILE_EXTS = [
 'exr',
 'jpeg',
 'jpg',
 'png',
 'tif',
 'tiff',
]
SEQUENCE_PATTERN = r'(.*)\.([0-9]+).(.{3,4})$'
def sequences_strigifier(sequences):
 output_string = ''
 for key, seq_info in sequences.items():
 if not seq_info:
 output_string += '{}\n'.format(key)
 continue
 if seq_info['start_index'] == seq_info['end_index']:
 output_string += '{}.{}.{}\n'.format(key,
 seq_info['start_index_str'],
 seq_info['ext'])
 continue
 output_string += '{}.{}.{}, [{}-{}]\n'.format(key,
 seq_info['start_index_str'],
 seq_info['ext'],
 seq_info['start_index'],
 seq_info['end_index'])
 return output_string.strip()
def find_image_sequences(directory):
 '''
 sequences get put into arrays like so:
 [x_folder, z_folder, [test_a.000.png, 0, 2], [test_b.000.tif, 0, 3], test_C.000.png]
 :return: [file1, file2, [first_file, seq_start, seq_end]]
 '''
 sequences = {}
 sorted_candidate_list = sorted(os.listdir(directory))
 for candidate_path in sorted_candidate_list:
 full_candidate_path = os.path.join(directory, candidate_path)
 if not os.path.isfile(full_candidate_path):
 sequences[candidate_path] = None
 continue
 matches = re.match(SEQUENCE_PATTERN, candidate_path)
 if not matches:
 sequences[candidate_path] = None
 continue
 filename = matches.group(1)
 sequence_index = matches.group(2)
 extension = matches.group(3)
 if not extension in FILE_EXTS:
 sequences[candidate_path] = None
 continue
 if not filename in sequences:
 sequences[filename] = {
 'ext': extension,
 'start_index_str': sequence_index,
 'start_index': int(sequence_index),
 'end_index': int(sequence_index),
 }
 continue
 sequences[filename]['end_index'] = int(sequence_index)
 return sequences_strigifier(sequences)
if __name__ == '__main__':
 if len(sys.argv) < 2:
 print('Usage: {} <dirname>'.format(sys.argv[0]))
 exit(1)
 print(find_image_sequences(sys.argv[1]))
answered May 7, 2018 at 18:19
\$\endgroup\$
10
  • \$\begingroup\$ Just a note that for a large set, the stringifier will be pretty slow so I would either print directly out of there or use join()/StringIO. \$\endgroup\$ Commented May 7, 2018 at 18:33
  • \$\begingroup\$ @Coal_ re usage format: Indeed. fixed! Good point on the exit as well. \$\endgroup\$ Commented May 8, 2018 at 13:33
  • \$\begingroup\$ Thanks Srdjan, lots of improvement! Sorry if I was unclear about the usage of this code, I do not want a string as the output so the stringifier part is unnecessary (dictionary works for me). And sys.arv is not needed because this being used as a module within my larger program, and will be called directly. I took your code with some tweaks and inserted it into my original Q, let me know what you think. The use of a dictionary for 'sequences', and regex, definitely cleans things up! However this runs only a tad faster than my original code, any ideas to speed it up? \$\endgroup\$ Commented May 8, 2018 at 19:30
  • 1
    \$\begingroup\$ Please see What to do when someone answers . I have rolled back Rev 5 → 3. \$\endgroup\$ Commented May 9, 2018 at 3:49
  • 1
    \$\begingroup\$ For the record I posted this regex problem on SO and got some good answers here: stackoverflow.com/questions/50311032/… \$\endgroup\$ Commented May 12, 2018 at 22:40

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.