Taking YouTube links out of a list of GitHub repo READMEs

Question 1

I have a .csv file that contains student GitHub repository assignment submissions. I made a script to go to each repository and extract the YouTube video that they must have provided in their README file.

The structure of the CSV file is as follows:

Timestamp,Name,Student Number,Git Repo link

#!/usr/bin/python3
import csv
import github3
import time
import re
import argparse
from secrets import username, password
# API rate limit for authenticated requests is way higher than anonymous, so login.
gh = github3.login(username, password=password)
# gh = github3.GitHub() # Anonymous
def parse_args():
 parser = argparse.ArgumentParser()
 parser.add_argument("filepath", type=str, metavar="filepath", help="Filepath to the input csv file.")
 args = parser.parse_args()
 args = vars(args) # Turn into dict-like view.
 return args
def get_row_count(filename):
 with open(filename, 'r') as file:
 return sum(1 for row in csv.reader(file))
def get_repositories(link):
 if gh.rate_limit()['resources']['search']['remaining'] == 0:
 print("API rate exceeded, sleeping for {0} seconds.".format(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1)))
 time.sleep(gh.rate_limit()['resources']['search']['reset'] - int(time.time()+1))
 return gh.search_repositories(link.replace("https://github.com/", "", 1), "", 1)
def main():
 filepath = parse_args()['filepath']
 if not filepath.endswith('.csv'):
 print("Input file must be a .csv file.")
 exit()
 p = re.compile(r"http(?:s?):\/\/(?:www\.)?youtu(?:be\.com\/watch\?v=|\.be\/)([\w\-\_]*)(&(amp;)?‌[\w\?‌=]*)?") # From http://stackoverflow.com/a/3726073/6549676
 row_counter = 0
 row_count = get_row_count(filepath)
 with open(filepath, 'r') as infile, open(filepath[:3] + "_ytlinks.csv", "w") as outfile:
 reader = csv.reader(infile)
 next(reader, None) # Skip header
 writer = csv.writer(outfile)
 writer.writerow(["Youtube Link", "Name", "GitHub Link"]) # Write header
 for row in reader:
 for repo in get_repositories(row[3]):
 readme = repo.repository.readme().decoded
 if not readme:
 readme = "No Youtube link found."
 if type(readme) is bytes:
 readme = readme.decode('utf-8')
 ids = p.findall(readme)
 if len(ids) != 0:
 ids = ids[0]
 ids = [x for x in ids if x]
 for _id in ids:
 writer.writerow(['https://www.youtube.com/watch?v={0}'.format(_id), row[1], row[3]])
 if len(ids) == 0:
 writer.writerow(['No Youtube Link Found', row[1], row[3]])
 print('Processed row {0} out of {1}'.format(row_counter, row_count))
 row_counter += 1
 print("Finished.")
if __name__ == "__main__":
 main()

Question 2

Here are some concerns/suggestions:

you are reading the file twice - once to get the row count and when reading the links. And, you don't need to initialize the csv.reader to get the row count, simply use sum() over the lines in the file. You would probably need to use infile.seek(0) after getting the count and before initializing the csv reader
use _ for the throw-away variables (when counting the number of lines)
if len(ids) == 0: can be simplified as if not ids:
it looks like you don't need .findall() and should use .search() method since you are up to a single match
if there is a single repository link per line, you probably should have get_repository() method instead of get_repositories() and avoid the for repo in get_repositories(row[3]): loop - remember, "Flat is better than nested"
instead of handling the enumeration with row_counter manually, use enumerate()
instead of accessing the current row fields by index - e.g. row[1] or row[3], you can unpack the row in the for loop, something like (an example, I don't know your actual CSV input format):
```
for index, username, _, github_link in reader:
```
Or, you can use a csv.DictReader - accessing the fields by column names instead of indexes would improve readability - e.g. row["github_link"] instead of row[3]
you don't have to convert the args to a dictionary - return args and then access the arguments using a dot notation - e.g. args.filepath

Question 3

str is already the default type of any variable parsed with argparse. Also the default metavar is just the name itself.

The cool thing about argparse is that it allows you to use custom variable parsing functions as type. So you could put the file type (or actually only file ending) check there:

import os
def csv_file(filepath):
 if not filepath.endswith('.csv'):
 raise ValueError("Input file must be a .csv file.")
 if not os.path.isfile(filepath):
 raise FileNotFoundError("Could not find file {}".format(filepath)
 return filepath
def parse_args():
 parser = argparse.ArgumentParser()
 parser.add_argument("filepath", type=csv_file, help="Filepath to the input csv file.")
 return vars(parser.parse_args())

If an exception is raised in the type function, argparse will catch it, display it and exit the program.

Question 4

Awesome use of argparse custom types! Fits the problem nicely.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-02-09 20:58:00Z

Here are some concerns/suggestions:

you are reading the file twice - once to get the row count and when reading the links. And, you don't need to initialize the csv.reader to get the row count, simply use sum() over the lines in the file. You would probably need to use infile.seek(0) after getting the count and before initializing the csv reader
use _ for the throw-away variables (when counting the number of lines)
if len(ids) == 0: can be simplified as if not ids:
it looks like you don't need .findall() and should use .search() method since you are up to a single match
if there is a single repository link per line, you probably should have get_repository() method instead of get_repositories() and avoid the for repo in get_repositories(row[3]): loop - remember, "Flat is better than nested"
instead of handling the enumeration with row_counter manually, use enumerate()
instead of accessing the current row fields by index - e.g. row[1] or row[3], you can unpack the row in the for loop, something like (an example, I don't know your actual CSV input format):
```
for index, username, _, github_link in reader:
```
Or, you can use a csv.DictReader - accessing the fields by column names instead of indexes would improve readability - e.g. row["github_link"] instead of row[3]
you don't have to convert the args to a dictionary - return args and then access the arguments using a dot notation - e.g. args.filepath

Stack Exchange Network

Taking YouTube links out of a list of GitHub repo READMEs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Taking YouTube links out of a list of GitHub repo READMEs

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions