3
\$\begingroup\$

I've created a crawler which is scraping name, phone number and web address of each profile from houzz website. Hope I did it the right way. Here is what I've written:

import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def scraper_func(mainurl):
 response = requests.get(mainurl).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//a[@class='sidebar-item-label']"):
 link = titles.xpath(".//@href")
 for item in link:
 paging_stuff(item)
# Done crawling links to the category from left-sided bar
def paging_stuff(process_links):
 response = requests.get(process_links).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//ul[@class='pagination']"):
 link = titles.xpath(".//a[@class='pageNumber']/@href")
 for item in link:
 processing_stuff(item)
# Going to each page to crawl the whole links spread through pagination connected to the profile page
def processing_stuff(procured_links):
 response = requests.get(procured_links).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//div[@class='name-info']"):
 links = titles.xpath(".//a[@class='pro-title']/@href")[0]
 main_stuff(links)
# Going to the profile page of each link
def main_stuff(main_links):
 response = requests.get(main_links).text
 tree = html.fromstring(response)
 def if_exist(titles,xpath):
 info=titles.xpath(xpath)
 if info:
 return info[0]
 return ""
 for titles in tree.xpath("//div[@class='profile-cover']"):
 name = if_exist(titles,".//a[@class='profile-full-name']/text()")
 phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
 web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
 print(name,phone,web)
scraper_func(url)
asked May 31, 2017 at 17:06
\$\endgroup\$

1 Answer 1

4
\$\begingroup\$

First of all, you should definitely re-use the same session for multiple requests to the same domain - it should result into a performance improvement:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Other Improvements

  • improve naming: you are over reusing item and titles variables. Instead, think of more appropriate and meaningful variable names. Also, I don't think the "_stuff" prefix contributes to readability and ease of understanding of the program
  • put the main script execution logic to under if __name__ == '__main__': to avoid executing it on import

  • you can avoid inner loops and iterate directly over hrefs here:

    for link in tree.xpath("//a[@class='sidebar-item-label']/@href"):
     paging_stuff(link)
    

    And here:

    for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
     processing_stuff(link)
    
  • instead of putting comments before the functions, put them into appropriate docstrings

Notes

Note that you should realize that your solution is synchronous - you are processing urls sequentially one by one. If performance matters, consider looking into Scrapy.

answered May 31, 2017 at 17:42
\$\endgroup\$
6
  • \$\begingroup\$ Thanks sir alecxe, for your advice and suggestion. I've already used scrapy to crawl site like this. I wanted to make sure whether I can apply the way I started here if need be. Btw, you once give me a demo on how to use session and in that case request was made once. If i consider this example, it is hard for me to use session cause when multiple requests are concerned then i don't know how to deploy session. Thanks again. \$\endgroup\$ Commented May 31, 2017 at 19:17
  • \$\begingroup\$ @SMth80 good. What do you mean by "deploy session"? Thanks. \$\endgroup\$ Commented May 31, 2017 at 20:53
  • \$\begingroup\$ Thanks sir for you concern. I meant, apply or use session in multiple requests. Don't get me wrong for my linguistic difficulty. \$\endgroup\$ Commented May 31, 2017 at 20:55
  • \$\begingroup\$ @SMth80 sure, your english is great. In the simplest case, you can pass the session instance as an argument to every function and use session.get() instead of requests.get(). Though, having a class and a session class attribute would probably be better in terms of code organization. Thanks. \$\endgroup\$ Commented May 31, 2017 at 20:57
  • 1
    \$\begingroup\$ @SMth80 please consider posting this as a question with your current code and as maximum details as possible on SO. This way it would be easier to help and more people may potentially help - not just me here in comments. Thank you for understanding! \$\endgroup\$ Commented May 31, 2017 at 21:22

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.