I'm using the following lump of code to manage a 9.4gb dataset. I had to divide the dataset into multiple github repositories to be able to do this. I've explained what each block of code does.
git_repo_tags = ['AB', 'C', 'DEF', 'G', 'HILMNO', 'PR', 'STW', 'X']
counter = 1
# Cloning the github repositories
print('Beginning cloning...')
for repo in git_repo_tags:
git.Git('.').clone('git://github.com/utility-repos/' + repo)
print('-\nCloning ' + repo)
#Removing the .git folder from each repo
shutil.rmtree(repo + '/.git')
print('--Removing the .git folder ' + str(counter) + '/8')
counter += 1
# Creating the Food-101/images directory and subdirectory if it doesn't already exist
if not os.path.exists('Food-101/images'):
os.makedirs('Food-101/images')
print('Created the Food-101/images')
# Going through the repo X and moving everything a branch up
for i in os.listdir('X'):
shutil.move(os.path.join('X', i), 'Food-101')
print('Moved important files to an upper branch')
# Going through the other repos and moving everything to Food-101/images
for directory in git_repo_tags:
for subdirectory in os.listdir(directory):
shutil.move(os.path.join(directory, subdirectory), 'Food-101/images')
print('Moving ' + subdirectory + ' to Food-101/images')
#After the above code is complete, moves all test images to the Food-101/test folder and renames them
print('\n-Beginning to separate the test dataset...')
if not os.path.exists('Food-101/test'):
os.makedirs('Food-101/test')
with open('Food-101/meta/test.txt') as test_file:
for line in test_file:
name_of_folder = line.split('/')[0]
name_of_file = line.split('/')[1].rstrip()
Path('Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg').rename('Food-101/test/' + name_of_folder + '_' + name_of_file + '.jpg')
print('--Moved Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg to Food-101/test/')
# Moves all training images to the Food-101/images directory and renames them
print('\n-Beginning to separate the training dataset...')
with open('Food-101/meta/train.txt') as train_file:
for line in train_file:
name_of_folder = line.split('/')[0]
name_of_file = line.split('/')[1].rstrip()
Path('Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg').rename('Food-101/images/' + name_of_folder + '_' + name_of_file + '.jpg')
print('--Moved Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg to Food-101/train/')
# Removes empty directories inside Food-101/images
with open('Food-101/meta/train.txt') as train_file:
for folder in train_file:
name_of_folder = folder.split('/')[0]
if os.path.exists('Food-101/images/' + name_of_folder):
shutil.rmtree('Food-101/images/' + name_of_folder)
# Removes empty directories
for dirs in git_repo_tags:
shutil.rmtree(dirs)
This code works but its a mess and I have too many repeats. What is a good way to clean this up?
1 Answer 1
Enumerate
counter = 1
for repo in git_repo_tags:
# ...
print('--Removing the .git folder ' + str(counter) + '/8')
counter += 1
should be using enumerate
:
for counter, repo in enumerate(git_repo_tags, start=1):
String interpolation
print('--Removing the .git folder ' + str(counter) + '/8')
can be
print(f'--Removing the .git folder {counter}/{len(git_repo_tags)}')
The 8 should not be hard-coded.
Pathlib
For basically every one of your directory and file names, and many of your file operations (rmtree
being an exception), you should consider using pathlib.Path
. For instance, this:
if not os.path.exists('Food-101/images'):
os.makedirs('Food-101/images')
print('Created the Food-101/images')
can be
image_path = Path('Food-101/images')
if not image_path.exists():
image_path.mkdir(parents=True)
print(f'Created {image_path}')
Path parsing
Rather than this:
name_of_folder = line.split('/')[0]
name_of_file = line.split('/')[1].rstrip()
consider at least unpacking it, i.e.
folder_name, file_name = line.rsplit('/', 1)
But it's better to again use pathlib
:
line_path = Path(line)
folder_name = line_path.parent
file_name = line_path.name
Functions
Move logically-related chunks of code to subroutines for better legibility, maintainability, modularity, testability, etc.
Indentation
Use four spaces, which is more standard. You do this in some places but not others.