I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

Question 1

I am new to python and programming in general and I got to Chapter 13 of Automate The Boring Stuff With Python which teaches Web Scraping, I succeeded in some projects but I'm having trouble getting the HTML from pypi.org, the task is to get the response from a search page in there and open the first few links, I can do it if I manually download the page as a .html file but i can't do it through the URL.

Here is the code that's supposed to work:

import requests, sys, webbrowser, bs4
print('Searching...') # Display text while downloading the search results page.
res = requests.get('https://pypi.org/search/?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser') # Error happens here
# Open a browser tab for each result.
link_elems = soup.select('.package-snippet') # Returns an empty list
num_open = min(5, len(link_elems))
for i in range(num_open):
 url_to_open = 'https://pypi.org' + link_elems[i].get('href')
 print('Opening', url_to_open)
 webbrowser.open(url_to_open)

Here is what the beautiful soup object returns: https://pastebin.com/nkfPuh9h

I turned off uBlock so judging by `Failed to load script: ${src}, Please contact the service administrator.` and:

A required part of this site couldn’t load. This may be due to a browser
 extension, network issues, or browser settings.

Maybe it has something to do with Javascript? That code was fully provided by Automate the Boring Stuff so pypi.org could have changed something since then. I would just like to know if it is a mistake on my end or not so I can get this right because at the end I will have to write a script to download images from Imgur or Flickr, but I won't be able to do that if I can't do simple webscraping.

Question 2

The implementation of pypi.org has almost certainly changed since that tutorial was written.

Question 3

"This less than ideal outcome is indeed the result of needing to protect PyPI against automated/scripted access/scraping." - discuss.python.org/t/fastly-interfering-with-pypi-search/73597/…

Question 4

For this sort of thing, your browser's network log is your friend. Observe that on the first request from the browser (and, indeed, from requests) you get the JavaScript boilerplate response. That was based on a request having these HTTP headers:

{
 "Request Cookies": {
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Then the browser and server trade 11 different requests. After that process, there are two more cookies:

{
 "Request Cookies": {
 "_fs_ch_cp_79UUvfpJ5mWYtLQv": "AfbhJyysvvsbJtu8HXHh5H5sxF6I9kfNC0HMz5OD1EiIA_P4Sg43tHdjJURyqFfnHrKHnOSTJ81FEZ3xRCG1xJPSO8App_Fyp36mtYQJxblln_mY3Iyvbk5xGfdLGGOrpf0iUuDbNMGjpO-zLGKyfa4YNaLkMcRjODp2nig5eilUenoVyyejGPM1RMltCWCFOgQoNSVmEJpOi5tGGyQpOWOGSjQxjIZLn8rYUQAA_eU_J7zHDHuc9U8rQB15DUe0oe-u6bBwWv8A9SsY048qvDD3cf7NfjidtFv7uUf41Tq0tk4qUYgf1pme6DF1NXCSPNlB4at7-7Q8Q0E=",
 "_fs_ch_st_FSBmUei20MqUiJb9": "Ae07IWGsUCGnAkUElmKBqsqkpkg0zeBmWVu3w8n8wi3U0w9ETq3TtIhaPNy8yqDP6PFbSIquGsq0xgGIQZHTOXxaMdRiRH9S0w9O7SsX0mlu8n3h_5bovvkAHf28KEfbviJcNpjnqnJj6QbACn3XoU5d0DHIkK0tAAnUT49o5YdfL1dsq-paCUgU0y7d-jmDBMgrHyGWcyH1O8WxK4ROsuJAsKOgP6IcizapL82yKo2PtjQJRcQDRzBtEG0KqUlTFFE78Gkd3q_LHke5mgL9ttgvsW8WO-iZRMzOzuuYdzulHlNUChA_w7eq4SalQrRlM8hQ4C1BXLBd6BOjpb8Gi-pN67q9vRAi_6iVqO62RvcvVXnUQCdOlwLUGPEUdgMEju3EoZJwWJS3vuJ9",
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Without those cookies, the server will not authorise actual content to be returned to you. This is probably by design, since pypi uses Fastly; read Protecting Against Scrapers with Fastly Bot Management.

Long story short, if you actually cared about pypi specifically, it's going to be difficult and counter to the intent of the web host. If all you care about is to learn how to perform simple scraping, choose a different site that does not require dynamic auth.

Question 5

Thank you for the clarification! They likely started using Fastly after the lesson was written as there wasn't any mention to that. Since I just need to learn the basics I'm ok with using any other website.

Question 6

It does have something to do with Javascript.

In the response, you can see the following lines

<span class="noscript-span">JavaScript is disabled in your browser.</span>
<p>Please enable JavaScript to proceed.</p>

This suggests that the pypi website uses Javascript to load its contents. The website expects that you send the request through a browser, which will receive exactly the HTML response you are seeing, and then proceed to run some Javascript code to retrieve the actual contents.

Since you are sending the request and receiving the response through requests , you are getting the HTML response, but not the actual contents as the Javascript code is not being executed.

Unfortunately, this is a limitation of using requests and bs4 for webscraping. To overcome this, you may want to look into Selenium.

Question 7

Thanks, I'll now know to try Selenium when facing such issues.

Reinderien 16.8k9 gold badges57 silver badges93 bronze badges · Accepted Answer · 2025-12-06 13:12:20Z

For this sort of thing, your browser's network log is your friend. Observe that on the first request from the browser (and, indeed, from requests) you get the JavaScript boilerplate response. That was based on a request having these HTTP headers:

{
 "Request Cookies": {
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Then the browser and server trade 11 different requests. After that process, there are two more cookies:

{
 "Request Cookies": {
 "_fs_ch_cp_79UUvfpJ5mWYtLQv": "AfbhJyysvvsbJtu8HXHh5H5sxF6I9kfNC0HMz5OD1EiIA_P4Sg43tHdjJURyqFfnHrKHnOSTJ81FEZ3xRCG1xJPSO8App_Fyp36mtYQJxblln_mY3Iyvbk5xGfdLGGOrpf0iUuDbNMGjpO-zLGKyfa4YNaLkMcRjODp2nig5eilUenoVyyejGPM1RMltCWCFOgQoNSVmEJpOi5tGGyQpOWOGSjQxjIZLn8rYUQAA_eU_J7zHDHuc9U8rQB15DUe0oe-u6bBwWv8A9SsY048qvDD3cf7NfjidtFv7uUf41Tq0tk4qUYgf1pme6DF1NXCSPNlB4at7-7Q8Q0E=",
 "_fs_ch_st_FSBmUei20MqUiJb9": "Ae07IWGsUCGnAkUElmKBqsqkpkg0zeBmWVu3w8n8wi3U0w9ETq3TtIhaPNy8yqDP6PFbSIquGsq0xgGIQZHTOXxaMdRiRH9S0w9O7SsX0mlu8n3h_5bovvkAHf28KEfbviJcNpjnqnJj6QbACn3XoU5d0DHIkK0tAAnUT49o5YdfL1dsq-paCUgU0y7d-jmDBMgrHyGWcyH1O8WxK4ROsuJAsKOgP6IcizapL82yKo2PtjQJRcQDRzBtEG0KqUlTFFE78Gkd3q_LHke5mgL9ttgvsW8WO-iZRMzOzuuYdzulHlNUChA_w7eq4SalQrRlM8hQ4C1BXLBd6BOjpb8Gi-pN67q9vRAi_6iVqO62RvcvVXnUQCdOlwLUGPEUdgMEju3EoZJwWJS3vuJ9",
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Without those cookies, the server will not authorise actual content to be returned to you. This is probably by design, since pypi uses Fastly; read Protecting Against Scrapers with Fastly Bot Management.

Long story short, if you actually cared about pypi specifically, it's going to be difficult and counter to the intent of the web host. If all you care about is to learn how to perform simple scraping, choose a different site that does not require dynamic auth.

Thank you for the clarification! They likely started using Fastly after the lesson was written as there wasn't any mention to that. Since I just need to learn the basics I'm ok with using any other website.

CollectivesTM on Stack Overflow

I have a problem with the request module in Automate Boring Stuff With Python - Chapter 13

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related