7
\$\begingroup\$

I had to make a program to crawl through a bunch of PyPI packages and see how many of them implement custom compare operators (i.e. grep for def __le__ etc). After downloading an HTML file with links to all Python 3.4 packages on PyPI (i.e. the directory page), I wrote this simple crawler to go through all the links, download and unzip each package, and grep them for custom compare definitions. It's rudimentary, but still, what are your comments? This is my first "shell-script" style Python program, i.e. a program where you're not computing stuff but just moving around files and doing networking.

Code:

import sys
assert(sys.version_info >= (3,5))
import re
from requests import get
import subprocess
def run(s):
 return subprocess.run(s,
 shell=True,
 stdout=subprocess.DEVNULL,
 stderr=subprocess.DEVNULL)
directory = open('list.html').read() #download list from Browse Packages ->
 # Python 3.4 -> Show All
custom = no_custom = failiures = 0
for (package_url, package_name) in \
 re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
 print(custom+no_custom+failiures,
 custom,
 no_custom,
 failiures)
 try:
 package_page = get(package_url).text
 (download_url,file_type) = re.search('<a href="(.+)">.+(\.tar\.gz|\.zip)</a>',
 package_page).groups()
 print(package_name)
 archive = open('archive', 'wb')
 archive.write(get(download_url).content)
 archive.close()
 run('rm -r package_code')
 run('mkdir package_code')
 if file_type == '.tar.gz':
 run('tar -xzf archive -C package_code')
 if file_type == '.zip':
 run('unzip archive -d package_code')
 return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode
 if return_code == 0:
 custom += 1
 elif return_code == 1:
 no_custom += 1
 else:
 failiures += 1
 except (KeyboardInterrupt, SystemExit):
 raise
 except Exception as exception: #for when there is no .tar.gz or .zip on PyPI
 #or (rarely) when the connection is dropped
 print("FAILIURE:", type(exception).__name__)
 failiures += 1
print("""
Packages that define custom compare operators: %i
Packages that don't define custom operators: %i
Packages that didn't have source on PyPI: %i
""" % (custom, no_custom, failiures))
asked Feb 20, 2017 at 0:52
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

Here are some comments/notes about the code and potential improvements:

If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using Scrapy web-scraping framework which is based on the twisted networking library.

answered Feb 20, 2017 at 14:35
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.