I am new to python and programming in general and I got to Chapter 13 of Automate The Boring Stuff With Python which teaches Web Scraping, I succeeded in some projects but I'm having trouble getting the HTML from pypi.org, the task is to get the response from a search page in there and open the first few links, I can do it if I manually download the page as a .html file but i can't do it through the URL.
Here is the code that's supposed to work:
import requests, sys, webbrowser, bs4
print('Searching...') # Display text while downloading the search results page.
res = requests.get('https://pypi.org/search/?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser') # Error happens here
# Open a browser tab for each result.
link_elems = soup.select('.package-snippet') # Returns an empty list
num_open = min(5, len(link_elems))
for i in range(num_open):
url_to_open = 'https://pypi.org' + link_elems[i].get('href')
print('Opening', url_to_open)
webbrowser.open(url_to_open)
Here is what the beautiful soup object returns: https://pastebin.com/nkfPuh9h
I turned off uBlock so judging by `Failed to load script: ${src}, Please contact the service administrator.` and:
A required part of this site couldn’t load. This may be due to a browser
extension, network issues, or browser settings.
Maybe it has something to do with Javascript? That code was fully provided by Automate the Boring Stuff so pypi.org could have changed something since then. I would just like to know if it is a mistake on my end or not so I can get this right because at the end I will have to write a script to download images from Imgur or Flickr, but I won't be able to do that if I can't do simple webscraping.
-
2The implementation of pypi.org has almost certainly changed since that tutorial was written.jackal– jackal2025年12月06日 06:22:17 +00:00Commented Dec 6 at 6:22
-
2"This less than ideal outcome is indeed the result of needing to protect PyPI against automated/scripted access/scraping." - discuss.python.org/t/fastly-interfering-with-pypi-search/73597/…jqurious– jqurious2025年12月06日 09:47:34 +00:00Commented Dec 6 at 9:47
2 Answers 2
For this sort of thing, your browser's network log is your friend. Observe that on the first request from the browser (and, indeed, from requests) you get the JavaScript boilerplate response. That was based on a request having these HTTP headers:
{
"Request Cookies": {
"session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
}
}
Then the browser and server trade 11 different requests. After that process, there are two more cookies:
{
"Request Cookies": {
"_fs_ch_cp_79UUvfpJ5mWYtLQv": "AfbhJyysvvsbJtu8HXHh5H5sxF6I9kfNC0HMz5OD1EiIA_P4Sg43tHdjJURyqFfnHrKHnOSTJ81FEZ3xRCG1xJPSO8App_Fyp36mtYQJxblln_mY3Iyvbk5xGfdLGGOrpf0iUuDbNMGjpO-zLGKyfa4YNaLkMcRjODp2nig5eilUenoVyyejGPM1RMltCWCFOgQoNSVmEJpOi5tGGyQpOWOGSjQxjIZLn8rYUQAA_eU_J7zHDHuc9U8rQB15DUe0oe-u6bBwWv8A9SsY048qvDD3cf7NfjidtFv7uUf41Tq0tk4qUYgf1pme6DF1NXCSPNlB4at7-7Q8Q0E=",
"_fs_ch_st_FSBmUei20MqUiJb9": "Ae07IWGsUCGnAkUElmKBqsqkpkg0zeBmWVu3w8n8wi3U0w9ETq3TtIhaPNy8yqDP6PFbSIquGsq0xgGIQZHTOXxaMdRiRH9S0w9O7SsX0mlu8n3h_5bovvkAHf28KEfbviJcNpjnqnJj6QbACn3XoU5d0DHIkK0tAAnUT49o5YdfL1dsq-paCUgU0y7d-jmDBMgrHyGWcyH1O8WxK4ROsuJAsKOgP6IcizapL82yKo2PtjQJRcQDRzBtEG0KqUlTFFE78Gkd3q_LHke5mgL9ttgvsW8WO-iZRMzOzuuYdzulHlNUChA_w7eq4SalQrRlM8hQ4C1BXLBd6BOjpb8Gi-pN67q9vRAi_6iVqO62RvcvVXnUQCdOlwLUGPEUdgMEju3EoZJwWJS3vuJ9",
"session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
}
}
Without those cookies, the server will not authorise actual content to be returned to you. This is probably by design, since pypi uses Fastly; read Protecting Against Scrapers with Fastly Bot Management.
Long story short, if you actually cared about pypi specifically, it's going to be difficult and counter to the intent of the web host. If all you care about is to learn how to perform simple scraping, choose a different site that does not require dynamic auth.
1 Comment
It does have something to do with Javascript.
In the response, you can see the following lines
<span class="noscript-span">JavaScript is disabled in your browser.</span>
<p>Please enable JavaScript to proceed.</p>
This suggests that the pypi website uses Javascript to load its contents. The website expects that you send the request through a browser, which will receive exactly the HTML response you are seeing, and then proceed to run some Javascript code to retrieve the actual contents.
Since you are sending the request and receiving the response through requests , you are getting the HTML response, but not the actual contents as the Javascript code is not being executed.
Unfortunately, this is a limitation of using requests and bs4 for webscraping. To overcome this, you may want to look into Selenium.