web scraping tutorial using python 3?

Question 1

I am trying to learn python 3.x so that I can scrape websites. People have recommended that I use Beautiful Soup 4 or lxml.html. Could someone point me in the right direction for tutorial or examples for BeautifulSoup with python 3.x?

Thank you for your help.

Question 2

If you want to do web scraping, use Python 2. Scrapy is by far the best web scraping framework for Python and has no 3.x equivalent.

Question 3

I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.

Here's some code to get you started with web scraping in Python:

import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
 # dynamically build the URL that we'll be making a request to
 url = "http://www.google.com/search?q={term}".format(
 term=keyword.strip().replace(" ", "+"),
 )
 # spoof some headers so the request appears to be coming from a browser, not a bot
 headers = {
 "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
 "accept-encoding": "gzip,deflate,sdch",
 "accept-language": "en-US,en;q=0.8",
 }
 # make the request to the search url, passing in the the spoofed headers.
 r = requests.get(url, headers=headers) # assign the response to a variable r
 # check the status code of the response to make sure the request went well
 if r.status_code != 200:
 print("request denied")
 return
 else:
 print("scraping " + url)
 # convert the plaintext HTML markup into a DOM-like structure that we can search
 soup = BeautifulSoup(r.text)
 # each result is an <li> element with class="g" this is our wrapper
 results = soup.findAll("li", "g")
 # iterate over each of the result wrapper elements
 for result in results:
 # the main link is an <h3> element with class="r"
 result_anchor = result.find("h3", "r").find("a")
 # print out each link in the results
 print(result_anchor.contents)
if __name__ == "__main__":
 # you can pass in a keyword to search for when you run the script
 # be default, we'll search for the "web scraping" keyword
 try:
 keyword = sys.argv[1]
 except IndexError:
 keyword = "web scraping"
 scrape_google(keyword)

If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.

Hartley Brody 9,24714 gold badges40 silver badges49 bronze badges · Accepted Answer · 2013-08-05 01:37:24Z

I've actually just written a full guide on web scraping that includes some sample code in Python. I wrote and tested in on Python 2.7 but both the of the packages that I used (requests and BeautifulSoup) are fully compatible with Python 3 according to the Wall of Shame.

Here's some code to get you started with web scraping in Python:

import sys
import requests
from BeautifulSoup import BeautifulSoup
def scrape_google(keyword):
 # dynamically build the URL that we'll be making a request to
 url = "http://www.google.com/search?q={term}".format(
 term=keyword.strip().replace(" ", "+"),
 )
 # spoof some headers so the request appears to be coming from a browser, not a bot
 headers = {
 "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5)",
 "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
 "accept-charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
 "accept-encoding": "gzip,deflate,sdch",
 "accept-language": "en-US,en;q=0.8",
 }
 # make the request to the search url, passing in the the spoofed headers.
 r = requests.get(url, headers=headers) # assign the response to a variable r
 # check the status code of the response to make sure the request went well
 if r.status_code != 200:
 print("request denied")
 return
 else:
 print("scraping " + url)
 # convert the plaintext HTML markup into a DOM-like structure that we can search
 soup = BeautifulSoup(r.text)
 # each result is an <li> element with class="g" this is our wrapper
 results = soup.findAll("li", "g")
 # iterate over each of the result wrapper elements
 for result in results:
 # the main link is an <h3> element with class="r"
 result_anchor = result.find("h3", "r").find("a")
 # print out each link in the results
 print(result_anchor.contents)
if __name__ == "__main__":
 # you can pass in a keyword to search for when you run the script
 # be default, we'll search for the "web scraping" keyword
 try:
 keyword = sys.argv[1]
 except IndexError:
 keyword = "web scraping"
 scrape_google(keyword)

If you just want to learn more about Python 3 in general and are already familiar with Python 2.x, then this article on transitioning from Python 2 to Python 3 might be helpful.

CollectivesTM on Stack Overflow

web scraping tutorial using python 3?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related