6
\$\begingroup\$

Today, while coming across a tutorial made by ScrapingHub on Scrapy about how it usually deals with a webpage while scraping it's content. I could see that the same logic applied in Scrapy can be applied in regular method in python.

So, I tried to make one. Here is what I did. My scraper opens a webpage, parses the 10 tags from it's right sided area then tracks down each tags and then again follows its pagination and parses the whole content. Hope I did it the right way. Here is the code:

import requests ; from lxml import html
core_link = "http://quotes.toscrape.com/"
def quotes_scraper(base_link):
 response = requests.get(base_link)
 tree = html.fromstring(response.text)
 for titles in tree.cssselect("span.tag-item a.tag"):
 processing_docs(core_link + titles.attrib['href'])
def processing_docs(base_link):
 response = requests.get(base_link).text
 root = html.fromstring(response)
 for soups in root.cssselect("div.quote"):
 quote = soups.cssselect("span.text")[0].text
 author = soups.cssselect("small.author")[0].text
 print(quote, author)
 next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
 if next_page:
 page_link = core_link + next_page
 processing_docs(page_link)
quotes_scraper(core_link)
alecxe
17.5k8 gold badges52 silver badges93 bronze badges
asked Aug 31, 2017 at 12:31
\$\endgroup\$

3 Answers 3

4
\$\begingroup\$

Here are some of the things I would improve in the code:

  • put each import on a separate line, there is not much point in saving space in this case:

    from lxml import html
    import requests
    
  • as usual and I think we've discussed it on CR already - re-use requests.Session() instance to make your requests - this will help to improve on the speed of downloading the source code
  • see if you can avoid using div or span container tag names in your selectors - they usually are not relevant for your locators. For example, having .tag-item a.tag instead of span.tag-item a.tag
  • execute your quotes_scraper from the if __name__ == '__main__': block
  • there can be better variable names used - e.g. tag would probably better than titles, soups can be quote, quote can be quote_text and author can be quote_author. Every time you create a new variable think of what it represents carefully - make sure next time you stumble upon it, you immediately know what is it about.
answered Aug 31, 2017 at 14:16
\$\endgroup\$
2
  • \$\begingroup\$ Css selectors can be written like this was foreign to me. I never have learnt it from the source. I try to learn studying how others write it and implement in a scraper. Btw, if you have time just make me update with these two terms: synchronous and asynchronous. I know what it means according to dictionary but in programming languages what they mean is unknown to me. Thanks sir, for your descriptive review. \$\endgroup\$ Commented Aug 31, 2017 at 14:39
  • \$\begingroup\$ @Shahin you'll develop a sense of what locators are more reliable and which are not, what are more concise and readable and which are not while writing more and more scrapers - even though I can see that the quality of your scrapers have already improved through time. As far as sync vs async, I suggest you look through this topic, Scrapy architercture and this guy :) Thanks. \$\endgroup\$ Commented Aug 31, 2017 at 14:46
4
\$\begingroup\$

Even though you are trying to mimic what Scrapy spider might look like, there is a very major high-level difference between how your code is executed and how a Scrapy spider is.

Scrapy is entirely asynchronous since it is based on a twisted network library which makes the code operate in a non-blocking nature, to quote the documentation:

Requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

But, in your case, you do go through the whole process in a blocking mode, processing requests one by one:

  1. Extract tag links from the main page
  2. Make a request to the next tag in the list and wait for the response
  3. Extract quotes and print them
  4. Extract the link to the next page and go to step number 2^

Every time requests make a request, you wait. Scrapy would do things while waiting, that's the main high-level difference.

And, by the way, if you would not assign priority values for your Scrapy requests (by default all requests have the same priority), there is no enforced order in which requests are gonna be processed.

answered Aug 31, 2017 at 14:08
\$\endgroup\$
1
  • \$\begingroup\$ I needed to know how scrapy applies it's logic. Right you are, even though there is an error scrapy doesn't get stuck there; rather, go ahead until it's execution is exhausted. I didn't ever analyze it's architectural pattern or way of motility. So, I judged it perfunctorily. You helped me learn at least a bit. Thank sir. \$\endgroup\$ Commented Aug 31, 2017 at 14:32
1
\$\begingroup\$

Things to improve with requests library usage:

  • timeouts: without a timeout, your code may hang for minutes or more. source
  • User-Agent: some websites do not accept making requests if you are using bad User-Agent; e.g. standard requests User-Agent is 'python-requests/2.13.0'

Here is an example:

import requests
r = requests.get("http://quotes.toscrape.com/",
 headers={"User-Agent": USER_AGENT},
 timeout=REQUESTS_TIMEOUT)
answered Sep 4, 2017 at 18:16
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.