I have a .csv file that contains student GitHub repository assignment submissions. I made a script to go to each repository and extract the YouTube video that they must have provided in their README file.
The structure of the CSV file is as follows:
Timestamp,Name,Student Number,Git Repo link
#!/usr/bin/python3
import csv
import github3
import time
import re
import argparse
from secrets import username, password
# API rate limit for authenticated requests is way higher than anonymous, so login.
gh = github3.login(username, password=password)
# gh = github3.GitHub() # Anonymous
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=str, metavar="filepath", help="Filepath to the input csv file.")
args = parser.parse_args()
args = vars(args) # Turn into dict-like view.
return args
def get_row_count(filename):
with open(filename, 'r') as file:
return sum(1 for row in csv.reader(file))
def get_repositories(link):
if gh.rate_limit()['resources']['search']['remaining'] == 0:
print("API rate exceeded, sleeping for {0} seconds.".format(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1)))
time.sleep(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1))
return gh.search_repositories(link.replace("https://github.com/", "", 1), "", 1)
def main():
filepath = parse_args()['filepath']
if not filepath.endswith('.csv'):
print("Input file must be a .csv file.")
exit()
p = re.compile(r"http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_]*)(&(amp;)?[\w\?=]*)?") # From http://stackoverflow.com/a/3726073/6549676
row_counter = 0
row_count = get_row_count(filepath)
with open(filepath, 'r') as infile, open(filepath[:3] + "_ytlinks.csv", "w") as outfile:
reader = csv.reader(infile)
next(reader, None) # Skip header
writer = csv.writer(outfile)
writer.writerow(["Youtube Link", "Name", "GitHub Link"]) # Write header
for row in reader:
for repo in get_repositories(row[3]):
readme = repo.repository.readme().decoded
if not readme:
readme = "No Youtube link found."
if type(readme) is bytes:
readme = readme.decode('utf-8')
ids = p.findall(readme)
if len(ids) != 0:
ids = ids[0]
ids = [x for x in ids if x]
for _id in ids:
writer.writerow(['https://www.youtube.com/watch?v={0}'.format(_id), row[1], row[3]])
if len(ids) == 0:
writer.writerow(['No Youtube Link Found', row[1], row[3]])
print('Processed row {0} out of {1}'.format(row_counter, row_count))
row_counter += 1
print("Finished.")
if __name__ == "__main__":
main()
2 Answers 2
Here are some concerns/suggestions:
- you are reading the file twice - once to get the row count and when reading the links. And, you don't need to initialize the
csv.reader
to get the row count, simply usesum()
over the lines in the file. You would probably need to useinfile.seek(0)
after getting the count and before initializing the csv reader - use
_
for the throw-away variables (when counting the number of lines) if len(ids) == 0:
can be simplified asif not ids:
- it looks like you don't need
.findall()
and should use.search()
method since you are up to a single match - if there is a single repository link per line, you probably should have
get_repository()
method instead ofget_repositories()
and avoid thefor repo in get_repositories(row[3]):
loop - remember, "Flat is better than nested" - instead of handling the enumeration with
row_counter
manually, useenumerate()
instead of accessing the current row fields by index - e.g.
row[1]
orrow[3]
, you can unpack the row in the for loop, something like (an example, I don't know your actual CSV input format):for index, username, _, github_link in reader:
Or, you can use a
csv.DictReader
- accessing the fields by column names instead of indexes would improve readability - e.g.row["github_link"]
instead ofrow[3]
- you don't have to convert the
args
to a dictionary - returnargs
and then access the arguments using a dot notation - e.g.args.filepath
str
is already the default type of any variable parsed with argparse
. Also the default metavar
is just the name itself.
The cool thing about argparse
is that it allows you to use custom variable parsing functions as type. So you could put the file type (or actually only file ending) check there:
import os
def csv_file(filepath):
if not filepath.endswith('.csv'):
raise ValueError("Input file must be a .csv file.")
if not os.path.isfile(filepath):
raise FileNotFoundError("Could not find file {}".format(filepath)
return filepath
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument("filepath", type=csv_file, help="Filepath to the input csv file.")
return vars(parser.parse_args())
If an exception is raised in the type function, argparse will catch it, display it and exit the program.
-
2\$\begingroup\$ Awesome use of
argparse
custom types! Fits the problem nicely. \$\endgroup\$alecxe– alecxe2017年02月10日 14:59:38 +00:00Commented Feb 10, 2017 at 14:59