3
\$\begingroup\$

I'm using the following lump of code to manage a 9.4gb dataset. I had to divide the dataset into multiple github repositories to be able to do this. I've explained what each block of code does.

git_repo_tags = ['AB', 'C', 'DEF', 'G', 'HILMNO', 'PR', 'STW', 'X']
counter = 1
# Cloning the github repositories
print('Beginning cloning...')
for repo in git_repo_tags:
 git.Git('.').clone('git://github.com/utility-repos/' + repo)
 print('-\nCloning ' + repo)
 #Removing the .git folder from each repo
 shutil.rmtree(repo + '/.git')
 print('--Removing the .git folder ' + str(counter) + '/8')
 counter += 1
# Creating the Food-101/images directory and subdirectory if it doesn't already exist
if not os.path.exists('Food-101/images'):
 os.makedirs('Food-101/images')
 print('Created the Food-101/images')
 # Going through the repo X and moving everything a branch up
 for i in os.listdir('X'):
 shutil.move(os.path.join('X', i), 'Food-101')
 print('Moved important files to an upper branch')
 # Going through the other repos and moving everything to Food-101/images
 for directory in git_repo_tags:
 for subdirectory in os.listdir(directory):
 shutil.move(os.path.join(directory, subdirectory), 'Food-101/images')
 print('Moving ' + subdirectory + ' to Food-101/images')
#After the above code is complete, moves all test images to the Food-101/test folder and renames them
print('\n-Beginning to separate the test dataset...')
if not os.path.exists('Food-101/test'):
 os.makedirs('Food-101/test')
with open('Food-101/meta/test.txt') as test_file:
 for line in test_file:
 name_of_folder = line.split('/')[0]
 name_of_file = line.split('/')[1].rstrip()
 Path('Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg').rename('Food-101/test/' + name_of_folder + '_' + name_of_file + '.jpg')
 print('--Moved Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg to Food-101/test/')
# Moves all training images to the Food-101/images directory and renames them
print('\n-Beginning to separate the training dataset...')
with open('Food-101/meta/train.txt') as train_file:
 for line in train_file:
 name_of_folder = line.split('/')[0]
 name_of_file = line.split('/')[1].rstrip()
 Path('Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg').rename('Food-101/images/' + name_of_folder + '_' + name_of_file + '.jpg')
 print('--Moved Food-101/images/' + name_of_folder + '/' + name_of_file + '.jpg to Food-101/train/')
# Removes empty directories inside Food-101/images
with open('Food-101/meta/train.txt') as train_file:
 for folder in train_file:
 name_of_folder = folder.split('/')[0]
 if os.path.exists('Food-101/images/' + name_of_folder):
 shutil.rmtree('Food-101/images/' + name_of_folder)
# Removes empty directories 
for dirs in git_repo_tags:
 shutil.rmtree(dirs)

This code works but its a mess and I have too many repeats. What is a good way to clean this up?

asked Apr 25, 2020 at 5:07
\$\endgroup\$

1 Answer 1

3
\$\begingroup\$

Enumerate

counter = 1
for repo in git_repo_tags:
 # ...
 print('--Removing the .git folder ' + str(counter) + '/8')
 counter += 1

should be using enumerate:

for counter, repo in enumerate(git_repo_tags, start=1):

String interpolation

print('--Removing the .git folder ' + str(counter) + '/8')

can be

print(f'--Removing the .git folder {counter}/{len(git_repo_tags)}')

The 8 should not be hard-coded.

Pathlib

For basically every one of your directory and file names, and many of your file operations (rmtree being an exception), you should consider using pathlib.Path. For instance, this:

if not os.path.exists('Food-101/images'):
 os.makedirs('Food-101/images')
 print('Created the Food-101/images')

can be

image_path = Path('Food-101/images')
if not image_path.exists():
 image_path.mkdir(parents=True)
 print(f'Created {image_path}')

Path parsing

Rather than this:

name_of_folder = line.split('/')[0]
name_of_file = line.split('/')[1].rstrip()

consider at least unpacking it, i.e.

folder_name, file_name = line.rsplit('/', 1)

But it's better to again use pathlib:

line_path = Path(line)
folder_name = line_path.parent
file_name = line_path.name

Functions

Move logically-related chunks of code to subroutines for better legibility, maintainability, modularity, testability, etc.

Indentation

Use four spaces, which is more standard. You do this in some places but not others.

answered Apr 27, 2020 at 17:14
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.