3
\$\begingroup\$

I've written a script in Python to scrape app name, developer and price from ITunes site.

In its first page there are 10 different recommended app links which my scraper is able to track down and scrape the aforesaid fields going to their concerning page. The quantity of the apps should be 11. Previously, the scraper was printing only 10 results skipping the content from first page. However, there is a line in my scraper (I've written a comment next to it to make you identify) which I've added in such a way so that the scraper do not fail to skip the content of first page from printing.

With this extra line in it, the scraper has got a clumsy look. It didn't impact on the performance, though! It runs smoothly now and brings results without leaving anything to complain. How can I give it a better look along with any improvement?

This is what I've written:

import requests
from bs4 import BeautifulSoup 
main_link = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"
def get_links(url):
 response = requests.get(url)
 get_content(url) # This is the very line which let the scraper harvest documents from the first page
 soup = BeautifulSoup(response.text,"html.parser")
 for title in soup.select("a.name"):
 get_content(title.get("href"))
def get_content(new_links):
 req = requests.get(new_links)
 broth = BeautifulSoup(req.text,"html.parser")
 item ={
 "app_name": broth.select("h1[itemprop=name]")[0].text,
 "developer": broth.select("div.left h2")[0].text, 
 "price": broth.select("div.price")[0].text
 }
 print(item)
get_links(main_link)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Aug 28, 2017 at 13:03
\$\endgroup\$

3 Answers 3

2
\$\begingroup\$

Instead of having a get_content function you could add a function that only parses a source passed to it, now it can be either the main page or the suggested apps. Note that you're requesting for main page's content twice even though you had it already.

In addition to the above point you could make few more improvements:

  • Make the code PEP 8 compatible. Currently you will see decent number of issues if you run it on http://pep8online.com/.

  • Use session to re-use connections. Since we are making requests to the same host this would speed up the requests. From docs:

    The Session object allows you to persist certain parameters across requests. It also persists cookies across all requests made from the Session instance, and will use urllib3's connection pooling. So if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase (see HTTP persistent connection).

  • Use __name__ == '__main__' to prevent your code from running when it is imported as a module.

After making the changes your code may look like this:

import requests
from bs4 import BeautifulSoup
session = requests.Session()
def get_links(url):
 source = session.get(url).text
 main_app = parse_content(source)
 print(main_app)
 for linked_app in get_linked_app_links(source):
 print(linked_app)
def get_linked_app_links(source):
 soup = BeautifulSoup(source, "html.parser")
 for title in soup.select("a.name"):
 linked_app = get_app_data(title.get("href"))
 yield linked_app
def get_app_data(url):
 source = session.get(url).text
 return parse_content(source)
def parse_content(source):
 broth = BeautifulSoup(source, "html.parser")
 item = {
 "app_name": broth.select("h1[itemprop=name]")[0].text,
 "developer": broth.select("div.left h2")[0].text,
 "price": broth.select("div.price")[0].text
 }
 return item
answered Aug 28, 2017 at 14:45
\$\endgroup\$
1
  • \$\begingroup\$ Thanks a lot for your descriptive review Ashwini. You write code with analytical approach and it's not that easy for me to understand everything looking at it once. Sorry for the delayed response. \$\endgroup\$ Commented Aug 28, 2017 at 15:43
2
\$\begingroup\$

I don't see nothing wrong with first scrape the main page and then - in the for loop - linked ones.

Anyway, you code don't follow the PEP 8 - Style Guide for Python Code and the DRY principle (Don't Repeat Yourself).

As it is not so long I suggest you to change it to something like

import requests
from bs4 import BeautifulSoup
MAIN_LINK = "https://itunes.apple.com/us/app/candy-crush-saga/id553834731?mt=8"
TAGS = {"app_name": "h1[itemprop=name]",
 "developer": "div.left h2",
 "price": "div.price"}
def get_links(url):
 response = requests.get(url)
 get_content(url)
 soup = BeautifulSoup(response.text, "html.parser")
 for title in soup.select("a.name"):
 get_content(title.get("href"))
def get_content(url):
 req = requests.get(url)
 broth = BeautifulSoup(req.text, "html.parser")
 item = {key: broth.select(tag)[0].text for key, tag in TAGS.items()}
 print(item)
get_links(MAIN_LINK)

Notes:

  1. I changed the name of the parameter of your get_content() function to url.
  2. Constant names with all uppercase letters.
  3. Concordance with the already mentioned PEP 8 - Style Guide (2 blank lines between function definition, recommended formatting of multi-line directories, and so on).
  4. Instead of repeating 3 times similar code (broth.select(...)[0].text) I created a (constant) directory TAGS and derive from it the directory item (with help of so called directory comprehension).
answered Aug 28, 2017 at 15:02
\$\endgroup\$
0
1
\$\begingroup\$

Adding to Ashwini's answer, here are some more things to improve:

The main performance bottleneck here is, of course, the fact that your script operates in a blocking nature - it does not advance to the next page until the current one is done, waiting on the network blocking everything else from being executed - if performance is critical, look into asynchronous solutions like Scrapy.

answered Aug 28, 2017 at 14:51
\$\endgroup\$
3
  • \$\begingroup\$ Thanks sir alecxe, for the input. You always come up with something new. " .select_one()" looks smarter as well \$\endgroup\$ Commented Aug 28, 2017 at 15:12
  • \$\begingroup\$ Btw, sir alecxe, is there anything similar to "select_one()" if i go for lxml? I used so far .cssselect() and [0]. Thanks. \$\endgroup\$ Commented Aug 28, 2017 at 15:59
  • \$\begingroup\$ @Mithu don't recall anything like this, but should be straightforward to implement something like that on your own. Thanks. \$\endgroup\$ Commented Aug 28, 2017 at 16:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.