5

I am trying to learn python 3.x so that I can scrape websites. People have recommended that I use Beautiful Soup 4 or lxml.html. Could someone point me in the right direction for tutorial or examples for BeautifulSoup with python 3.x?

Thank you for your help.

jamylak
135k30 gold badges238 silver badges240 bronze badges
asked May 28, 2013 at 1:52
1
  • 2
    If you want to do web scraping, use Python 2. Scrapy is by far the best web scraping framework for Python and has no 3.x equivalent. Commented May 28, 2013 at 1:55

1 Answer 1

16

I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.

Here's some code to get you started with web scraping in Python:

import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
 # dynamically build the URL that we'll be making a request to
 url = "http://www.google.com/search?q={term}".format(
 term=keyword.strip().replace(" ", "+"),
 )
 # spoof some headers so the request appears to be coming from a browser, not a bot
 headers = {
 "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
 "accept-encoding": "gzip,deflate,sdch",
 "accept-language": "en-US,en;q=0.8",
 }
 # make the request to the search url, passing in the the spoofed headers.
 r = requests.get(url, headers=headers) # assign the response to a variable r
 # check the status code of the response to make sure the request went well
 if r.status_code != 200:
 print("request denied")
 return
 else:
 print("scraping " + url)
 # convert the plaintext HTML markup into a DOM-like structure that we can search
 soup = BeautifulSoup(r.text)
 # each result is an <li> element with class="g" this is our wrapper
 results = soup.findAll("li", "g")
 # iterate over each of the result wrapper elements
 for result in results:
 # the main link is an <h3> element with class="r"
 result_anchor = result.find("h3", "r").find("a")
 # print out each link in the results
 print(result_anchor.contents)
if __name__ == "__main__":
 # you can pass in a keyword to search for when you run the script
 # be default, we'll search for the "web scraping" keyword
 try:
 keyword = sys.argv[1]
 except IndexError:
 keyword = "web scraping"
 scrape_google(keyword)

If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.

answered Aug 5, 2013 at 1:37
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.