A simple PyPI crawler

Question 1

I had to make a program to crawl through a bunch of PyPI packages and see how many of them implement custom compare operators (i.e. grep for def __le__ etc). After downloading an HTML file with links to all Python 3.4 packages on PyPI (i.e. the directory page), I wrote this simple crawler to go through all the links, download and unzip each package, and grep them for custom compare definitions. It's rudimentary, but still, what are your comments? This is my first "shell-script" style Python program, i.e. a program where you're not computing stuff but just moving around files and doing networking.

Code:

import sys
assert(sys.version_info >= (3,5))
import re
from requests import get
import subprocess
def run(s):
 return subprocess.run(s,
 shell=True,
 stdout=subprocess.DEVNULL,
 stderr=subprocess.DEVNULL)
directory = open('list.html').read() #download list from Browse Packages ->
 # Python 3.4 -> Show All
custom = no_custom = failiures = 0
for (package_url, package_name) in \
 re.findall('(https://pypi\.python\.org/pypi/([^/]+)/)', directory):
 print(custom+no_custom+failiures,
 custom,
 no_custom,
 failiures)
 try:
 package_page = get(package_url).text
 (download_url,file_type) = re.search('<a href="(.+)">.+(\.tar\.gz|\.zip)</a>',
 package_page).groups()
 print(package_name)
 archive = open('archive', 'wb')
 archive.write(get(download_url).content)
 archive.close()
 run('rm -r package_code')
 run('mkdir package_code')
 if file_type == '.tar.gz':
 run('tar -xzf archive -C package_code')
 if file_type == '.zip':
 run('unzip archive -d package_code')
 return_code = run('grep -Er "def __(le|lt|ge|gt)__" ./package_code').returncode
 if return_code == 0:
 custom += 1
 elif return_code == 1:
 no_custom += 1
 else:
 failiures += 1
 except (KeyboardInterrupt, SystemExit):
 raise
 except Exception as exception: #for when there is no .tar.gz or .zip on PyPI
 #or (rarely) when the connection is dropped
 print("FAILIURE:", type(exception).__name__)
 failiures += 1
print("""
Packages that define custom compare operators: %i
Packages that don't define custom operators: %i
Packages that didn't have source on PyPI: %i
""" % (custom, no_custom, failiures))

Question 2

Here are some comments/notes about the code and potential improvements:

use with context manager when opening files
parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like BeautifulSoup and lxml.html. For example, getting all the PyPI links with BeautifulSoup can be as straightforward as:
```
from bs4 import BeautifulSoup
with open('list.html') as directory:
 soup = BeautifulSoup(directory, "html.parser")
 for link in soup.select("a[href*=pypi]"):
 print(link.get_text())
```
where a[href*=pypi] is a CSS selector that would match all a elements that has pypi substring inside an href attribute.
instead of using requests.get() directly, initialize a "session" to reuse the underlying TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using Scrapy web-scraping framework which is based on the twisted networking library.

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Answer 1 · 2017-02-20 14:35:48Z

Here are some comments/notes about the code and potential improvements:

use with context manager when opening files
parsing HTML with regular expressions has always been a very controversial thing to do. I would switch to an HTML parser like BeautifulSoup and lxml.html. For example, getting all the PyPI links with BeautifulSoup can be as straightforward as:
```
from bs4 import BeautifulSoup
with open('list.html') as directory:
 soup = BeautifulSoup(directory, "html.parser")
 for link in soup.select("a[href*=pypi]"):
 print(link.get_text())
```
where a[href*=pypi] is a CSS selector that would match all a elements that has pypi substring inside an href attribute.
instead of using requests.get() directly, initialize a "session" to reuse the underlying TCP connection:

..if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase..

If you want to scale this up, you would need to switch for a synchronous and blocking code/approach to asynchronous - look into using Scrapy web-scraping framework which is based on the twisted networking library.

Stack Exchange Network

A simple PyPI crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A simple PyPI crawler

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions