I've created a crawler which is scraping name, phone number and web address of each profile from houzz website. Hope I did it the right way. Here is what I've written:
import requests
from lxml import html
url="https://www.houzz.com/professionals/"
def scraper_func(mainurl):
response = requests.get(mainurl).text
tree = html.fromstring(response)
for titles in tree.xpath("//a[@class='sidebar-item-label']"):
link = titles.xpath(".//@href")
for item in link:
paging_stuff(item)
# Done crawling links to the category from left-sided bar
def paging_stuff(process_links):
response = requests.get(process_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//ul[@class='pagination']"):
link = titles.xpath(".//a[@class='pageNumber']/@href")
for item in link:
processing_stuff(item)
# Going to each page to crawl the whole links spread through pagination connected to the profile page
def processing_stuff(procured_links):
response = requests.get(procured_links).text
tree = html.fromstring(response)
for titles in tree.xpath("//div[@class='name-info']"):
links = titles.xpath(".//a[@class='pro-title']/@href")[0]
main_stuff(links)
# Going to the profile page of each link
def main_stuff(main_links):
response = requests.get(main_links).text
tree = html.fromstring(response)
def if_exist(titles,xpath):
info=titles.xpath(xpath)
if info:
return info[0]
return ""
for titles in tree.xpath("//div[@class='profile-cover']"):
name = if_exist(titles,".//a[@class='profile-full-name']/text()")
phone = if_exist(titles,".//a[contains(concat(' ', @class, ' '), ' click-to-call-link ')]/@phone")
web = if_exist(titles,".//a[@class='proWebsiteLink']/@href")
print(name,phone,web)
scraper_func(url)
1 Answer 1
First of all, you should definitely re-use the same session for multiple requests to the same domain - it should result into a performance improvement:
if you're making several requests to the same host, the underlying TCP connection will be reused, which can result in a significant performance increase
Other Improvements
- improve naming: you are over reusing
item
andtitles
variables. Instead, think of more appropriate and meaningful variable names. Also, I don't think the "_stuff" prefix contributes to readability and ease of understanding of the program put the main script execution logic to under
if __name__ == '__main__':
to avoid executing it on importyou can avoid inner loops and iterate directly over
href
s here:for link in tree.xpath("//a[@class='sidebar-item-label']/@href"): paging_stuff(link)
And here:
for link in tree.xpath("//ul[@class='pagination']//a[@class='pageNumber']/@href"): processing_stuff(link)
- instead of putting comments before the functions, put them into appropriate docstrings
Notes
Note that you should realize that your solution is synchronous - you are processing urls sequentially one by one. If performance matters, consider looking into Scrapy
.
-
\$\begingroup\$ Thanks sir alecxe, for your advice and suggestion. I've already used scrapy to crawl site like this. I wanted to make sure whether I can apply the way I started here if need be. Btw, you once give me a demo on how to use session and in that case request was made once. If i consider this example, it is hard for me to use session cause when multiple requests are concerned then i don't know how to deploy session. Thanks again. \$\endgroup\$SIM– SIM2017年05月31日 19:17:04 +00:00Commented May 31, 2017 at 19:17
-
\$\begingroup\$ @SMth80 good. What do you mean by "deploy session"? Thanks. \$\endgroup\$alecxe– alecxe2017年05月31日 20:53:24 +00:00Commented May 31, 2017 at 20:53
-
\$\begingroup\$ Thanks sir for you concern. I meant, apply or use session in multiple requests. Don't get me wrong for my linguistic difficulty. \$\endgroup\$SIM– SIM2017年05月31日 20:55:33 +00:00Commented May 31, 2017 at 20:55
-
\$\begingroup\$ @SMth80 sure, your english is great. In the simplest case, you can pass the
session
instance as an argument to every function and usesession.get()
instead ofrequests.get()
. Though, having a class and asession
class attribute would probably be better in terms of code organization. Thanks. \$\endgroup\$alecxe– alecxe2017年05月31日 20:57:29 +00:00Commented May 31, 2017 at 20:57 -
1\$\begingroup\$ @SMth80 please consider posting this as a question with your current code and as maximum details as possible on SO. This way it would be easier to help and more people may potentially help - not just me here in comments. Thank you for understanding! \$\endgroup\$alecxe– alecxe2017年05月31日 21:22:49 +00:00Commented May 31, 2017 at 21:22