2

I am new to python and programming in general and I got to Chapter 13 of Automate The Boring Stuff With Python which teaches Web Scraping, I succeeded in some projects but I'm having trouble getting the HTML from pypi.org, the task is to get the response from a search page in there and open the first few links, I can do it if I manually download the page as a .html file but i can't do it through the URL.

Here is the code that's supposed to work:

import requests, sys, webbrowser, bs4
print('Searching...') # Display text while downloading the search results page.
res = requests.get('https://pypi.org/search/?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
# Retrieve top search result links.
soup = bs4.BeautifulSoup(res.text, 'html.parser') # Error happens here
# Open a browser tab for each result.
link_elems = soup.select('.package-snippet') # Returns an empty list
num_open = min(5, len(link_elems))
for i in range(num_open):
 url_to_open = 'https://pypi.org' + link_elems[i].get('href')
 print('Opening', url_to_open)
 webbrowser.open(url_to_open)

Here is what the beautiful soup object returns: https://pastebin.com/nkfPuh9h

I turned off uBlock so judging by `Failed to load script: ${src}, Please contact the service administrator.` and:

A required part of this site couldn’t load. This may be due to a browser
 extension, network issues, or browser settings.

Maybe it has something to do with Javascript? That code was fully provided by Automate the Boring Stuff so pypi.org could have changed something since then. I would just like to know if it is a mistake on my end or not so I can get this right because at the end I will have to write a script to download images from Imgur or Flickr, but I won't be able to do that if I can't do simple webscraping.

asked Dec 6 at 1:50
2
  • 2
    The implementation of pypi.org has almost certainly changed since that tutorial was written. Commented Dec 6 at 6:22
  • 2
    "This less than ideal outcome is indeed the result of needing to protect PyPI against automated/scripted access/scraping." - discuss.python.org/t/fastly-interfering-with-pypi-search/73597/… Commented Dec 6 at 9:47

2 Answers 2

2

For this sort of thing, your browser's network log is your friend. Observe that on the first request from the browser (and, indeed, from requests) you get the JavaScript boilerplate response. That was based on a request having these HTTP headers:

{
 "Request Cookies": {
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Then the browser and server trade 11 different requests. After that process, there are two more cookies:

{
 "Request Cookies": {
 "_fs_ch_cp_79UUvfpJ5mWYtLQv": "AfbhJyysvvsbJtu8HXHh5H5sxF6I9kfNC0HMz5OD1EiIA_P4Sg43tHdjJURyqFfnHrKHnOSTJ81FEZ3xRCG1xJPSO8App_Fyp36mtYQJxblln_mY3Iyvbk5xGfdLGGOrpf0iUuDbNMGjpO-zLGKyfa4YNaLkMcRjODp2nig5eilUenoVyyejGPM1RMltCWCFOgQoNSVmEJpOi5tGGyQpOWOGSjQxjIZLn8rYUQAA_eU_J7zHDHuc9U8rQB15DUe0oe-u6bBwWv8A9SsY048qvDD3cf7NfjidtFv7uUf41Tq0tk4qUYgf1pme6DF1NXCSPNlB4at7-7Q8Q0E=",
 "_fs_ch_st_FSBmUei20MqUiJb9": "Ae07IWGsUCGnAkUElmKBqsqkpkg0zeBmWVu3w8n8wi3U0w9ETq3TtIhaPNy8yqDP6PFbSIquGsq0xgGIQZHTOXxaMdRiRH9S0w9O7SsX0mlu8n3h_5bovvkAHf28KEfbviJcNpjnqnJj6QbACn3XoU5d0DHIkK0tAAnUT49o5YdfL1dsq-paCUgU0y7d-jmDBMgrHyGWcyH1O8WxK4ROsuJAsKOgP6IcizapL82yKo2PtjQJRcQDRzBtEG0KqUlTFFE78Gkd3q_LHke5mgL9ttgvsW8WO-iZRMzOzuuYdzulHlNUChA_w7eq4SalQrRlM8hQ4C1BXLBd6BOjpb8Gi-pN67q9vRAi_6iVqO62RvcvVXnUQCdOlwLUGPEUdgMEju3EoZJwWJS3vuJ9",
 "session_id": "j0oYxE1-Qhzs6fBO-_1A3FDqi29ebG-_AJ_emhCNOmY.aTQpmg.pB-_QnxyQf1DD0AGT-dMtXF1LMe2uFIWEyqzvD0GjRmyWX7m3Dzdsmva7BXKJMcf3iPhGjbYFFNe_CvEcASCiQ"
 }
}

Without those cookies, the server will not authorise actual content to be returned to you. This is probably by design, since pypi uses Fastly; read Protecting Against Scrapers with Fastly Bot Management.

Long story short, if you actually cared about pypi specifically, it's going to be difficult and counter to the intent of the web host. If all you care about is to learn how to perform simple scraping, choose a different site that does not require dynamic auth.

answered Dec 6 at 13:12
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you for the clarification! They likely started using Fastly after the lesson was written as there wasn't any mention to that. Since I just need to learn the basics I'm ok with using any other website.
2

It does have something to do with Javascript.

In the response, you can see the following lines

<span class="noscript-span">JavaScript is disabled in your browser.</span>
<p>Please enable JavaScript to proceed.</p>

This suggests that the pypi website uses Javascript to load its contents. The website expects that you send the request through a browser, which will receive exactly the HTML response you are seeing, and then proceed to run some Javascript code to retrieve the actual contents.

Since you are sending the request and receiving the response through requests , you are getting the HTML response, but not the actual contents as the Javascript code is not being executed.

Unfortunately, this is a limitation of using requests and bs4 for webscraping. To overcome this, you may want to look into Selenium.

answered Dec 6 at 5:57

1 Comment

Thanks, I'll now know to try Selenium when facing such issues.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.