I had to make a program to crawl through a bunch of PyPI packages and see how many of them implement custom compare operators (i.e. grep for def __le__
etc). After downloading an HTML file with links to all Python 3.4 packages on PyPI (i.e. the directory page), I wrote this simple crawler to go through all the links, download and unzip each package, and grep them for custom compare definitions. It's rudimentary, but still, what are your comments? This is my first "shell-script" style Python program, i.e. a program where you're not computing stuff but just moving around files and doing networking.
Code:
import sys
assert(sys.version_info >= (3,5))
import re
from requests import get
import subprocess
def run(s):
return subprocess.run(s,
shell=True,
stdout=subprocess.DEVNULL,
stderr=subprocess.DEVNULL)
directory = open('list.html').read() #download list from Browse Packages ->
# Python 3.4 -> Show All
custom = no_custom = failiures = 0
for (package_url, package_name) in \
re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
print(custom+no_custom+failiures,
custom,
no_custom,
failiures)
try:
package_page = get(package_url).text
(download_url,file_type) = re.search('<a href="(.+)">.+(\.tar\.gz|\.zip)</a>',
package_page).groups()
print(package_name)
archive = open('archive', 'wb')
archive.write(get(download_url).content)
archive.close()
run('rm -r package_code')
run('mkdir package_code')
if file_type == '.tar.gz':
run('tar -xzf archive -C package_code')
if file_type == '.zip':
run('unzip archive -d package_code')
return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode
if return_code == 0:
custom += 1
elif return_code == 1:
no_custom += 1
else:
failiures += 1
except (KeyboardInterrupt, SystemExit):
raise
except Exception as exception: #for when there is no .tar.gz or .zip on PyPI
#or (rarely) when the connection is dropped
print("FAILIURE:", type(exception).__name__)
failiures += 1
print("""
Packages that define custom compare operators: %i
Packages that don't define custom operators: %i
Packages that didn't have source on PyPI: %i
""" % (custom, no_custom, failiures))
1 Answer 1
Here are some comments/notes about the code and potential improvements:
- use
with
context manager when opening files parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like
BeautifulSoup
andlxml.html
. For example, getting all the PyPI links withBeautifulSoup
can be as straightforward as:from bs4 import BeautifulSoup with open('list.html') as directory: soup = BeautifulSoup(directory, "html.parser") for link in soup.select("a[href*=pypi]"): print(link.get_text())
where
a[href*=pypi]
is a CSS selector that would match alla
elements that haspypi
substring inside anhref
attribute.instead of using
requests.get()
directly, initialize a "session" to reuse the underlying TCP connection:..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..
If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using Scrapy
web-scraping framework which is based on the twisted
networking library.
Explore related questions
See similar questions with these tags.