Scraping a dynamic website with Scrapy (or Requests) and Selenium

Question 1

I am trying to use Scrapy for one of the sites I've scraped before using Selenium over here.

Because the search field for this site is dynamically generated and requires the user to hover the cursor over a button before it appears, I can't seem to find a way to POST the query using Requests or Scrapy's spider alone.

In scrapy shell, though I can:

fetch(FormRequest.from_response(response, 
 formdata={'.search-left input':"尹至"}, 
 callback=self.search_result))

I have no way to tell whether the search query is successful or not.

Here is a simple working code which I will be using for my spider below.

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
 
def parse(url):
 with Firefox() as driver:
 driver.get(url)
 wait = WebDriverWait(driver, 100)
 xpath = "//form/button/input"
 element_to_hover_over = driver.find_element_by_xpath(xpath)
 hover = ActionChains(driver).move_to_element(element_to_hover_over)
 hover.perform()
 search = wait.until(
 EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
 )
 search.send_keys("尹至")
 search.submit()
 time.sleep(5)
 
 rows = driver.find_elements_by_css_selector(".search_list > li")
 for row in rows:
 caption_elems = row.find_element_by_tag_name('a')
 yield {
 "caption" : caption_elems.text,
 "date": row.find_element_by_class_name('time').text,
 "url": caption_elems.get_attribute('href')
 }
x = parse('https://www.ctwx.tsinghua.edu.cn')
for rslt in x:
 print(rslt)

The scrapy spider below stops short at entering the search query when I run scrapy crawl qinghua.

import scrapy
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
class QinghuaSpider(scrapy.Spider):
 name = 'qinghua'
 allowed_domains = ['https://www.ctwx.tsinghua.edu.cn']
 start_urls = ['https://www.ctwx.tsinghua.edu.cn']
 def __init__(self):
 self.driver = webdriver.Firefox()
 def parse(self, response):
 with Firefox() as driver:
 driver.get(response.url)
 wait = WebDriverWait(self.driver, 100)
 xpath = "//form/button/input"
 element_to_hover_over = driver.find_element_by_xpath(xpath)
 hover = ActionChains(driver).move_to_element(element_to_hover_over)
 hover.perform()
 search = wait.until(
 EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
 )
 search.send_keys("尹至")
 search.submit()
 time.sleep(5)
 rows = self.driver.find_elements_by_css_selector(".search_list > li")
 
 for row in rows:
 caption_elems = row.find_element_by_tag_name('a')
 yield {
 "caption" : caption_elems.text,
 "date": row.find_element_by_class_name('time').text,
 "url": caption_elems.get_attribute('href')
 }
# return FormRequest.from_response(
# response, 
# formdata={'.search-left input':"尹至"}, 
# callback=self.search_result)
 def search_result(self, response):
 pass

I would like to ask:

Why the spider code doesn't work, and
How to do this properly in Scrapy, with or (preferably) without the help of Selenium.

I suspect this website has a robust anti-bot infrastructure that can prevent spiders from operating properly.

Question 2

I think for this to be on topic, you're going to need to excise the part that doesn't work.

Question 3

@Heslacher: No. I did not update any code in my question. I merely provided extra information to point out where the present answer is lacking.

Question 4

I would agree to a rollback of the title if that is absolutely necessary.

Question 5

Oh, wow, sorry I just saw "The suggested code given by Reinderien below, though it produced an output, did not actually capture the keyword." and didn't check what you actually added. Sorry for that.

Question 6

Agreed.........

Question 7

the search field for this site is dynamically generated

That doesn't matter, since - if you bypass the UI - the form field name itself is not dynamic. Even if you were to keep using Selenium, it should be possible to write an element selector that does not need to rely on the dynamic attributes of the search field.

Why the spider code doesn't work

Non-working code is off-topic, so I'm ignoring that part.

I suspect this website has a robust anti-bot infrastructure that can prevent spiders from operating properly.

It actually doesn't (thankfully); and my prior difficulties were due to a silly error on my part omitting some form entries. This doesn't need to manipulate headers or cookies, or even fill in a fake user agent.

So in terms of review, the usual: use Requests if you can; improve your type safety; avoid dictionaries for internal data.

Suggested

from base64 import b64encode
from datetime import date
from typing import Iterable, ClassVar
from attr import dataclass
from bs4 import BeautifulSoup, SoupStrainer, Tag
from requests import Session
@dataclass
class Result:
 caption: str
 when: date
 path: str
 @classmethod
 def from_list_item(cls, item: Tag) -> 'Result':
 return cls(
 caption=item.a.text,
 path=item.a['href'],
 when=date.fromisoformat(item.find('span', recursive=False).text),
 )
class TsinghuaSite:
 subdoc: ClassVar[SoupStrainer] = SoupStrainer(name='ul', class_='search_list')
 def __init__(self):
 self.session = Session()
 def __enter__(self) -> 'TsinghuaSite':
 return self
 def __exit__(self, exc_type, exc_val, exc_tb):
 self.session.close()
 def search(self, query: str) -> Iterable[Result]:
 with self.session.post(
 'https://www.ctwx.tsinghua.edu.cn/search.jsp',
 params={'wbtreeid': 1001},
 data={
 'lucenenewssearchkey': b64encode(query.encode()),
 '_lucenesearchtype': '1',
 'searchScope': '0',
 'x': '0',
 'y': '0',
 },
 ) as resp:
 resp.raise_for_status()
 doc = BeautifulSoup(markup=resp.text, features='html.parser', parse_only=self.subdoc)
 for item in doc.find('ul', recursive=False).find_all('li', recursive=False):
 yield Result.from_list_item(item)
def main():
 with TsinghuaSite() as site:
 query = '尹至'
 results = tuple(site.search(query))
 assert any(query in r.caption for r in results)
 for result in results:
 print(result)
if __name__ == '__main__':
 main()

Output

Result(caption='出土文献研究与保护中心2020年报', when=datetime.date(2021, 4, 9), path='info/1041/2615.htm')
Result(caption='《战国秦汉文字与文献论稿》出版', when=datetime.date(2020, 7, 17), path='info/1012/1289.htm')
Result(caption='【光明日报】清华简十年:古书重现与古史新探', when=datetime.date(2018, 12, 25), path='info/1072/1551.htm')
Result(caption='《清華簡與古史探賾》出版', when=datetime.date(2018, 8, 30), path='info/1012/1436.htm')
Result(caption='【出土文獻第九輯】鄔可晶:《尹至》"惟(肉哉)虐德暴(身童)亡典"句試解', when=datetime.date(2018, 5, 24), path='info/1073/1952.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='《出土文獻》(第九輯)出版', when=datetime.date(2016, 10, 26), path='info/1012/1411.htm')
Result(caption='《出土文獻研究》第十三輯出版', when=datetime.date(2015, 4, 8), path='info/1012/1396.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='《出土文獻》(第五輯)出版', when=datetime.date(2014, 10, 13), path='info/1012/1393.htm')
Result(caption='清华简入选《国家珍贵古籍名录》', when=datetime.date(2013, 12, 11), path='info/1072/1496.htm')

Question 8

Could you elaborate on what's wrong with using Selenium? Are you talking about it in general or about OP's case in particular?

Question 9

@KonstantinKostanzhoglo Thanks; I added some nuance. The answer is "don't use Selenium certainly in this case, but also usually in general it should be avoided for scraping unless there are no alternatives".

Question 10

@KonstantinKostanzhoglo That's kind of not the right question to ask. Instead, you should ask: "Do I really need to worry about scrolling and clicking, or can I bypass the UI entirely?" Bypassing the UI entirely is strongly preferred.

Question 11

@Sati I rewrote this answer. It turns out it actually is trivially easy to use requests after all; I had just forgotten to include some form fields. Regarding your question on reading - basically google site reverse engineering; there are many many guides on this e.g. on medium.

Question 12

@AlexDotis Best practice for Python class member variables is to set them on the instance in the __init__, rather than them first appearing in another function. So either the session would need to be constructed as an Optional[] equal to None and then written in __enter__, which is awkward; or just initialized in the constructor. One other difference is that if a caller instantiates TsinghuaSite but then does not use context management, the class will still work (whereas it would not if the session were constructed in __enter__).

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2021-08-03 16:45:18Z

the search field for this site is dynamically generated

That doesn't matter, since - if you bypass the UI - the form field name itself is not dynamic. Even if you were to keep using Selenium, it should be possible to write an element selector that does not need to rely on the dynamic attributes of the search field.

Why the spider code doesn't work

Non-working code is off-topic, so I'm ignoring that part.

I suspect this website has a robust anti-bot infrastructure that can prevent spiders from operating properly.

It actually doesn't (thankfully); and my prior difficulties were due to a silly error on my part omitting some form entries. This doesn't need to manipulate headers or cookies, or even fill in a fake user agent.

So in terms of review, the usual: use Requests if you can; improve your type safety; avoid dictionaries for internal data.

Suggested

from base64 import b64encode
from datetime import date
from typing import Iterable, ClassVar
from attr import dataclass
from bs4 import BeautifulSoup, SoupStrainer, Tag
from requests import Session
@dataclass
class Result:
 caption: str
 when: date
 path: str
 @classmethod
 def from_list_item(cls, item: Tag) -> 'Result':
 return cls(
 caption=item.a.text,
 path=item.a['href'],
 when=date.fromisoformat(item.find('span', recursive=False).text),
 )
class TsinghuaSite:
 subdoc: ClassVar[SoupStrainer] = SoupStrainer(name='ul', class_='search_list')
 def __init__(self):
 self.session = Session()
 def __enter__(self) -> 'TsinghuaSite':
 return self
 def __exit__(self, exc_type, exc_val, exc_tb):
 self.session.close()
 def search(self, query: str) -> Iterable[Result]:
 with self.session.post(
 'https://www.ctwx.tsinghua.edu.cn/search.jsp',
 params={'wbtreeid': 1001},
 data={
 'lucenenewssearchkey': b64encode(query.encode()),
 '_lucenesearchtype': '1',
 'searchScope': '0',
 'x': '0',
 'y': '0',
 },
 ) as resp:
 resp.raise_for_status()
 doc = BeautifulSoup(markup=resp.text, features='html.parser', parse_only=self.subdoc)
 for item in doc.find('ul', recursive=False).find_all('li', recursive=False):
 yield Result.from_list_item(item)
def main():
 with TsinghuaSite() as site:
 query = '尹至'
 results = tuple(site.search(query))
 assert any(query in r.caption for r in results)
 for result in results:
 print(result)
if __name__ == '__main__':
 main()

Output

Result(caption='出土文献研究与保护中心2020年报', when=datetime.date(2021, 4, 9), path='info/1041/2615.htm')
Result(caption='《战国秦汉文字与文献论稿》出版', when=datetime.date(2020, 7, 17), path='info/1012/1289.htm')
Result(caption='【光明日报】清华简十年:古书重现与古史新探', when=datetime.date(2018, 12, 25), path='info/1072/1551.htm')
Result(caption='《清華簡與古史探賾》出版', when=datetime.date(2018, 8, 30), path='info/1012/1436.htm')
Result(caption='【出土文獻第九輯】鄔可晶:《尹至》"惟(肉哉)虐德暴(身童)亡典"句試解', when=datetime.date(2018, 5, 24), path='info/1073/1952.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='《出土文獻》(第九輯)出版', when=datetime.date(2016, 10, 26), path='info/1012/1411.htm')
Result(caption='《出土文獻研究》第十三輯出版', when=datetime.date(2015, 4, 8), path='info/1012/1396.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='《出土文獻》(第五輯)出版', when=datetime.date(2014, 10, 13), path='info/1012/1393.htm')
Result(caption='清华简入选《国家珍贵古籍名录》', when=datetime.date(2013, 12, 11), path='info/1072/1496.htm')

Could you elaborate on what's wrong with using Selenium? Are you talking about it in general or about OP's case in particular?
@KonstantinKostanzhoglo Thanks; I added some nuance. The answer is "don't use Selenium certainly in this case, but also usually in general it should be avoided for scraping unless there are no alternatives".
@KonstantinKostanzhoglo That's kind of not the right question to ask. Instead, you should ask: "Do I really need to worry about scrolling and clicking, or can I bypass the UI entirely?" Bypassing the UI entirely is strongly preferred.
@Sati I rewrote this answer. It turns out it actually is trivially easy to use requests after all; I had just forgotten to include some form fields. Regarding your question on reading - basically google site reverse engineering; there are many many guides on this e.g. on medium.
@AlexDotis Best practice for Python class member variables is to set them on the instance in the __init__, rather than them first appearing in another function. So either the session would need to be constructed as an Optional[] equal to None and then written in __enter__, which is awkward; or just initialized in the constructor. One other difference is that if a caller instantiates TsinghuaSite but then does not use context management, the class will still work (whereas it would not if the session were constructed in __enter__).

Stack Exchange Network

Scraping a dynamic website with Scrapy (or Requests) and Selenium

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Scraping a dynamic website with Scrapy (or Requests) and Selenium

1 Answer 1

Suggested

Output

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions