Scraping a webpage copying with the logic of scrapy

Question 1

Today, while coming across a tutorial made by ScrapingHub on Scrapy about how it usually deals with a webpage while scraping it's content. I could see that the same logic applied in Scrapy can be applied in regular method in python.

So, I tried to make one. Here is what I did. My scraper opens a webpage, parses the 10 tags from it's right sided area then tracks down each tags and then again follows its pagination and parses the whole content. Hope I did it the right way. Here is the code:

import requests ; from lxml import html
core_link = "http://quotes.toscrape.com/"
def quotes_scraper(base_link):
 response = requests.get(base_link)
 tree = html.fromstring(response.text)
 for titles in tree.cssselect("span.tag-item a.tag"):
 processing_docs(core_link + titles.attrib['href'])
def processing_docs(base_link):
 response = requests.get(base_link).text
 root = html.fromstring(response)
 for soups in root.cssselect("div.quote"):
 quote = soups.cssselect("span.text")[0].text
 author = soups.cssselect("small.author")[0].text
 print(quote, author)
 next_page = root.cssselect("li.next a")[0].attrib['href'] if root.cssselect("li.next a") else ""
 if next_page:
 page_link = core_link + next_page
 processing_docs(page_link)
quotes_scraper(core_link)

Question 2

Here are some of the things I would improve in the code:

put each import on a separate line, there is not much point in saving space in this case:
```
from lxml import html
import requests
```
as usual and I think we've discussed it on CR already - re-use requests.Session() instance to make your requests - this will help to improve on the speed of downloading the source code
see if you can avoid using div or span container tag names in your selectors - they usually are not relevant for your locators. For example, having .tag-item a.tag instead of span.tag-item a.tag
execute your quotes_scraper from the if __name__ == '__main__': block
there can be better variable names used - e.g. tag would probably better than titles, soups can be quote, quote can be quote_text and author can be quote_author. Every time you create a new variable think of what it represents carefully - make sure next time you stumble upon it, you immediately know what is it about.

Question 3

Css selectors can be written like this was foreign to me. I never have learnt it from the source. I try to learn studying how others write it and implement in a scraper. Btw, if you have time just make me update with these two terms: synchronous and asynchronous. I know what it means according to dictionary but in programming languages what they mean is unknown to me. Thanks sir, for your descriptive review.

Question 4

@Shahin you'll develop a sense of what locators are more reliable and which are not, what are more concise and readable and which are not while writing more and more scrapers - even though I can see that the quality of your scrapers have already improved through time. As far as sync vs async, I suggest you look through this topic, Scrapy architercture and this guy :) Thanks.

Question 5

Even though you are trying to mimic what Scrapy spider might look like, there is a very major high-level difference between how your code is executed and how a Scrapy spider is.

Scrapy is entirely asynchronous since it is based on a twisted network library which makes the code operate in a non-blocking nature, to quote the documentation:

Requests are scheduled and processed asynchronously. This means that Scrapy doesn’t need to wait for a request to be finished and processed, it can send another request or do other things in the meantime. This also means that other requests can keep going even if some request fails or an error happens while handling it.

But, in your case, you do go through the whole process in a blocking mode, processing requests one by one:

Extract tag links from the main page
Make a request to the next tag in the list and wait for the response
Extract quotes and print them
Extract the link to the next page and go to step number 2^

Every time requests make a request, you wait. Scrapy would do things while waiting, that's the main high-level difference.

And, by the way, if you would not assign priority values for your Scrapy requests (by default all requests have the same priority), there is no enforced order in which requests are gonna be processed.

Question 6

I needed to know how scrapy applies it's logic. Right you are, even though there is an error scrapy doesn't get stuck there; rather, go ahead until it's execution is exhausted. I didn't ever analyze it's architectural pattern or way of motility. So, I judged it perfunctorily. You helped me learn at least a bit. Thank sir.

Question 7

Things to improve with requests library usage:

timeouts: without a timeout, your code may hang for minutes or more. source
User-Agent: some websites do not accept making requests if you are using bad User-Agent; e.g. standard requests User-Agent is 'python-requests/2.13.0'

Here is an example:

import requests
r = requests.get("http://quotes.toscrape.com/",
 headers={"User-Agent": USER_AGENT},
 timeout=REQUESTS_TIMEOUT)

alecxe alecxe 17.5k8 gold badges52 silver badges93 bronze badges · Accepted Answer · 2017-08-31 14:16:18Z

Here are some of the things I would improve in the code:

put each import on a separate line, there is not much point in saving space in this case:
```
from lxml import html
import requests
```
as usual and I think we've discussed it on CR already - re-use requests.Session() instance to make your requests - this will help to improve on the speed of downloading the source code
see if you can avoid using div or span container tag names in your selectors - they usually are not relevant for your locators. For example, having .tag-item a.tag instead of span.tag-item a.tag
execute your quotes_scraper from the if __name__ == '__main__': block
there can be better variable names used - e.g. tag would probably better than titles, soups can be quote, quote can be quote_text and author can be quote_author. Every time you create a new variable think of what it represents carefully - make sure next time you stumble upon it, you immediately know what is it about.

Css selectors can be written like this was foreign to me. I never have learnt it from the source. I try to learn studying how others write it and implement in a scraper. Btw, if you have time just make me update with these two terms: synchronous and asynchronous. I know what it means according to dictionary but in programming languages what they mean is unknown to me. Thanks sir, for your descriptive review.
@Shahin you'll develop a sense of what locators are more reliable and which are not, what are more concise and readable and which are not while writing more and more scrapers - even though I can see that the quality of your scrapers have already improved through time. As far as sync vs async, I suggest you look through this topic, Scrapy architercture and this guy :) Thanks.

Stack Exchange Network

Scraping a webpage copying with the logic of scrapy

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Scraping a webpage copying with the logic of scrapy

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions