Web-crawler for iTunes

Question 1

I've written a script in Python to scrape app name, developer and price from ITunes site.

In its first page there are 10 different recommended app links which my scraper is able to track down and scrape the aforesaid fields going to their concerning page. The quantity of the apps should be 11. Previously, the scraper was printing only 10 results skipping the content from first page. However, there is a line in my scraper (I've written a comment next to it to make you identify) which I've added in such a way so that the scraper do not fail to skip the content of first page from printing.

With this extra line in it, the scraper has got a clumsy look. It didn't impact on the performance, though! It runs smoothly now and brings results without leaving anything to complain. How can I give it a better look along with any improvement?

This is what I've written:

import requests
from bs4 import BeautifulSoup 
main_link = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"
def get_links(url):
 response = requests.get(url)
 get_content(url) # This is the very line which let the scraper harvest documents from the first page
 soup = BeautifulSoup(response.text,"html.parser")
 for title in soup.select("a.name"):
 get_content(title.get("href"))
def get_content(new_links):
 req = requests.get(new_links)
 broth = BeautifulSoup(req.text,"html.parser")
 item ={
 "app_name": broth.select("h1[itemprop=name]")[0].text,
 "developer": broth.select("div.left h2")[0].text, 
 "price": broth.select("div.price")[0].text
 }
 print(item)
get_links(main_link)

Question 2

Instead of having a get_content function you could add a function that only parses a source passed to it, now it can be either the main page or the suggested apps. Note that you're requesting for main page's content twice even though you had it already.

In addition to the above point you could make few more improvements:

Make the code PEP 8 compatible. Currently you will see decent number of issues if you run it on http://pep8online.com/.
Use session to re-use connections. Since we are making requests to the same host this would speed up the requests. From docs:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
Use __name__ == '__main__' to prevent your code from running when it is imported as a module.

After making the changes your code may look like this:

import requests
from bs4 import BeautifulSoup
session = requests.Session()
def get_links(url):
 source = session.get(url).text
 main_app = parse_content(source)
 print(main_app)
 for linked_app in get_linked_app_links(source):
 print(linked_app)
def get_linked_app_links(source):
 soup = BeautifulSoup(source, "html.parser")
 for title in soup.select("a.name"):
 linked_app = get_app_data(title.get("href"))
 yield linked_app
def get_app_data(url):
 source = session.get(url).text
 return parse_content(source)
def parse_content(source):
 broth = BeautifulSoup(source, "html.parser")
 item = {
 "app_name": broth.select("h1[itemprop=name]")[0].text,
 "developer": broth.select("div.left h2")[0].text,
 "price": broth.select("div.price")[0].text
 }
 return item

Question 3

Thanks a lot for your descriptive review Ashwini. You write code with analytical approach and it's not that easy for me to understand everything looking at it once. Sorry for the delayed response.

Question 4

I don't see nothing wrong with first scrape the main page and then - in the for loop - linked ones.

Anyway, you code don't follow the PEP 8 - Style Guide for Python Code and the DRY principle (Don't Repeat Yourself).

As it is not so long I suggest you to change it to something like

import requests
from bs4 import BeautifulSoup
MAIN_LINK = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"
TAGS = {"app_name": "h1[itemprop=name]",
 "developer": "div.left h2",
 "price": "div.price"}
def get_links(url):
 response = requests.get(url)
 get_content(url)
 soup = BeautifulSoup(response.text, "html.parser")
 for title in soup.select("a.name"):
 get_content(title.get("href"))
def get_content(url):
 req = requests.get(url)
 broth = BeautifulSoup(req.text, "html.parser")
 item = {key: broth.select(tag)[0].text for key, tag in TAGS.items()}
 print(item)
get_links(MAIN_LINK)

Notes:

I changed the name of the parameter of your get_content() function to url.
Constant names with all uppercase letters.
Concordance with the already mentioned PEP 8 - Style Guide (2 blank lines between function definition, recommended formatting of multi-line directories, and so on).
Instead of repeating 3 times similar code (broth.select(...)[0].text) I created a (constant) directory TAGS and derive from it the directory item (with help of so called directory comprehension).

Question 5

Adding to Ashwini's answer, here are some more things to improve:

you can use lxml instead of html.parser to improve on HTML-parsing speed (see more at Differences between parsers)
you can use .select_one() instead of select() and [0]
.get_text() is generally better to use than accessing .text attribute directly

The main performance bottleneck here is, of course, the fact that your script operates in a blocking nature - it does not advance to the next page until the current one is done, waiting on the network blocking everything else from being executed - if performance is critical, look into asynchronous solutions like Scrapy.

Question 6

Thanks sir alecxe, for the input. You always come up with something new. " .select_one()" looks smarter as well

Question 7

Btw, sir alecxe, is there anything similar to "select_one()" if i go for lxml? I used so far .cssselect() and [0]. Thanks.

Question 8

@Mithu don't recall anything like this, but should be straightforward to implement something like that on your own. Thanks.

Ashwini Chaudhary Ashwini Chaudhary 3,15516 silver badges18 bronze badges · Accepted Answer · 2017-08-28 14:45:54Z

Instead of having a get_content function you could add a function that only parses a source passed to it, now it can be either the main page or the suggested apps. Note that you're requesting for main page's content twice even though you had it already.

In addition to the above point you could make few more improvements:

Make the code PEP 8 compatible. Currently you will see decent number of issues if you run it on http://pep8online.com/.
Use session to re-use connections. Since we are making requests to the same host this would speed up the requests. From docs:

The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).
Use __name__ == '__main__' to prevent your code from running when it is imported as a module.

After making the changes your code may look like this:

import requests
from bs4 import BeautifulSoup
session = requests.Session()
def get_links(url):
 source = session.get(url).text
 main_app = parse_content(source)
 print(main_app)
 for linked_app in get_linked_app_links(source):
 print(linked_app)
def get_linked_app_links(source):
 soup = BeautifulSoup(source, "html.parser")
 for title in soup.select("a.name"):
 linked_app = get_app_data(title.get("href"))
 yield linked_app
def get_app_data(url):
 source = session.get(url).text
 return parse_content(source)
def parse_content(source):
 broth = BeautifulSoup(source, "html.parser")
 item = {
 "app_name": broth.select("h1[itemprop=name]")[0].text,
 "developer": broth.select("div.left h2")[0].text,
 "price": broth.select("div.price")[0].text
 }
 return item

Thanks a lot for your descriptive review Ashwini. You write code with analytical approach and it's not that easy for me to understand everything looking at it once. Sorry for the delayed response.

Stack Exchange Network

Web-crawler for iTunes

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Web-crawler for iTunes

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions