Parsing a full site using python

Question 1

I've created a crawler which is scraping name, phone number and web address of each profile from houzz website. Hope I did it the right way. Here is what I've written:

import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def scraper_func(mainurl):
 response = requests.get(mainurl).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//a[@class='sidebar-item-label']"):
 link = titles.xpath(".//@href")
 for item in link:
 paging_stuff(item)
# Done crawling links to the category from left-sided bar
def paging_stuff(process_links):
 response = requests.get(process_links).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//ul[@class='pagination']"):
 link = titles.xpath(".//a[@class='pageNumber']/@href")
 for item in link:
 processing_stuff(item)
# Going to each page to crawl the whole links spread through pagination connected to the profile page
def processing_stuff(procured_links):
 response = requests.get(procured_links).text
 tree = html.fromstring(response)
 for titles in tree.xpath("//div[@class='name-info']"):
 links = titles.xpath(".//a[@class='pro-title']/@href")[0]
 main_stuff(links)
# Going to the profile page of each link
def main_stuff(main_links):
 response = requests.get(main_links).text
 tree = html.fromstring(response)
 def if_exist(titles,xpath):
 info=titles.xpath(xpath)
 if info:
 return info[0]
 return ""
 for titles in tree.xpath("//div[@class='profile-cover']"):
 name = if_exist(titles,".//a[@class='profile-full-name']/text()")
 phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
 web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
 print(name,phone,web)
scraper_func(url)

Question 2

First of all, you should definitely re-use the same session for multiple requests to the same domain - it should result into a performance improvement:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Other Improvements

improve naming: you are over reusing item and titles variables. Instead, think of more appropriate and meaningful variable names. Also, I don't think the "_stuff" prefix contributes to readability and ease of understanding of the program
put the main script execution logic to under if __name__ == '__main__': to avoid executing it on import

you can avoid inner loops and iterate directly over hrefs here:

for link in tree.xpath("//a[@class='sidebar-item-label']/@href"):
 paging_stuff(link)

And here:

for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
 processing_stuff(link)

instead of putting comments before the functions, put them into appropriate docstrings

Notes

Note that you should realize that your solution is synchronous - you are processing urls sequentially one by one. If performance matters, consider looking into Scrapy.

Question 3

Thanks sir alecxe, for your advice and suggestion. I've already used scrapy to crawl site like this. I wanted to make sure whether I can apply the way I started here if need be. Btw, you once give me a demo on how to use session and in that case request was made once. If i consider this example, it is hard for me to use session cause when multiple requests are concerned then i don't know how to deploy session. Thanks again.

Question 4

@SMth80 good. What do you mean by "deploy session"? Thanks.

Question 5

Thanks sir for you concern. I meant, apply or use session in multiple requests. Don't get me wrong for my linguistic difficulty.

Question 6

@SMth80 sure, your english is great. In the simplest case, you can pass the session instance as an argument to every function and use session.get() instead of requests.get(). Though, having a class and a session class attribute would probably be better in terms of code organization. Thanks.

Question 7

@SMth80 please consider posting this as a question with your current code and as maximum details as possible on SO. This way it would be easier to help and more people may potentially help - not just me here in comments. Thank you for understanding!

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-05-31 17:42:22Z

First of all, you should definitely re-use the same session for multiple requests to the same domain - it should result into a performance improvement:

if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase

Other Improvements

improve naming: you are over reusing item and titles variables. Instead, think of more appropriate and meaningful variable names. Also, I don't think the "_stuff" prefix contributes to readability and ease of understanding of the program
put the main script execution logic to under if __name__ == '__main__': to avoid executing it on import

you can avoid inner loops and iterate directly over hrefs here:

for link in tree.xpath("//a[@class='sidebar-item-label']/@href"):
 paging_stuff(link)

And here:

for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"):
 processing_stuff(link)

instead of putting comments before the functions, put them into appropriate docstrings

Notes

Note that you should realize that your solution is synchronous - you are processing urls sequentially one by one. If performance matters, consider looking into Scrapy.

Thanks sir alecxe, for your advice and suggestion. I've already used scrapy to crawl site like this. I wanted to make sure whether I can apply the way I started here if need be. Btw, you once give me a demo on how to use session and in that case request was made once. If i consider this example, it is hard for me to use session cause when multiple requests are concerned then i don't know how to deploy session. Thanks again.
Thanks sir for you concern. I meant, apply or use session in multiple requests. Don't get me wrong for my linguistic difficulty.
@SMth80 sure, your english is great. In the simplest case, you can pass the session instance as an argument to every function and use session.get() instead of requests.get(). Though, having a class and a session class attribute would probably be better in terms of code organization. Thanks.
@SMth80 please consider posting this as a question with your current code and as maximum details as possible on SO. This way it would be easier to help and more people may potentially help - not just me here in comments. Thank you for understanding!

Stack Exchange Network

Parsing a full site using python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parsing a full site using python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions