I am trying to use Scrapy for one of the sites I've scraped before using Selenium over here.
Because the search field for this site is dynamically generated and requires the user to hover the cursor over a button before it appears, I can't seem to find a way to POST the query using Requests
or Scrapy's spider alone.
In scrapy shell
, though I can:
fetch(FormRequest.from_response(response,
formdata={'.search-left input':"尹至"},
callback=self.search_result))
I have no way to tell whether the search query is successful or not.
Here is a simple working code which I will be using for my spider below.
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
def parse(url):
with Firefox() as driver:
driver.get(url)
wait = WebDriverWait(driver, 100)
xpath = "//form/button/input"
element_to_hover_over = driver.find_element_by_xpath(xpath)
hover = ActionChains(driver).move_to_element(element_to_hover_over)
hover.perform()
search = wait.until(
EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
)
search.send_keys("尹至")
search.submit()
time.sleep(5)
rows = driver.find_elements_by_css_selector(".search_list > li")
for row in rows:
caption_elems = row.find_element_by_tag_name('a')
yield {
"caption" : caption_elems.text,
"date": row.find_element_by_class_name('time').text,
"url": caption_elems.get_attribute('href')
}
x = parse('https://www.ctwx.tsinghua.edu.cn')
for rslt in x:
print(rslt)
The scrapy spider below stops short at entering the search query when I run scrapy crawl qinghua
.
import scrapy
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver import Firefox, FirefoxProfile
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
import time
class QinghuaSpider(scrapy.Spider):
name = 'qinghua'
allowed_domains = ['https://www.ctwx.tsinghua.edu.cn']
start_urls = ['https://www.ctwx.tsinghua.edu.cn']
def __init__(self):
self.driver = webdriver.Firefox()
def parse(self, response):
with Firefox() as driver:
driver.get(response.url)
wait = WebDriverWait(self.driver, 100)
xpath = "//form/button/input"
element_to_hover_over = driver.find_element_by_xpath(xpath)
hover = ActionChains(driver).move_to_element(element_to_hover_over)
hover.perform()
search = wait.until(
EC.presence_of_element_located((By.ID, 'showkeycode1015273'))
)
search.send_keys("尹至")
search.submit()
time.sleep(5)
rows = self.driver.find_elements_by_css_selector(".search_list > li")
for row in rows:
caption_elems = row.find_element_by_tag_name('a')
yield {
"caption" : caption_elems.text,
"date": row.find_element_by_class_name('time').text,
"url": caption_elems.get_attribute('href')
}
# return FormRequest.from_response(
# response,
# formdata={'.search-left input':"尹至"},
# callback=self.search_result)
def search_result(self, response):
pass
I would like to ask:
- Why the spider code doesn't work, and
- How to do this properly in Scrapy, with or (preferably) without the help of Selenium.
I suspect this website has a robust anti-bot infrastructure that can prevent spiders from operating properly.
-
\$\begingroup\$ I think for this to be on topic, you're going to need to excise the part that doesn't work. \$\endgroup\$Reinderien– Reinderien2021年08月03日 16:08:45 +00:00Commented Aug 3, 2021 at 16:08
-
\$\begingroup\$ @Heslacher: No. I did not update any code in my question. I merely provided extra information to point out where the present answer is lacking. \$\endgroup\$Sati– Sati2021年08月04日 06:16:16 +00:00Commented Aug 4, 2021 at 6:16
-
\$\begingroup\$ I would agree to a rollback of the title if that is absolutely necessary. \$\endgroup\$Sati– Sati2021年08月04日 06:20:22 +00:00Commented Aug 4, 2021 at 6:20
-
\$\begingroup\$ Oh, wow, sorry I just saw "The suggested code given by Reinderien below, though it produced an output, did not actually capture the keyword." and didn't check what you actually added. Sorry for that. \$\endgroup\$Heslacher– Heslacher2021年08月04日 06:21:56 +00:00Commented Aug 4, 2021 at 6:21
-
\$\begingroup\$ Agreed......... \$\endgroup\$Sati– Sati2021年08月05日 11:17:42 +00:00Commented Aug 5, 2021 at 11:17
1 Answer 1
the search field for this site is dynamically generated
That doesn't matter, since - if you bypass the UI - the form field name itself is not dynamic. Even if you were to keep using Selenium, it should be possible to write an element selector that does not need to rely on the dynamic attributes of the search field.
Why the spider code doesn't work
Non-working code is off-topic, so I'm ignoring that part.
I suspect this website has a robust anti-bot infrastructure that can prevent spiders from operating properly.
It actually doesn't (thankfully); and my prior difficulties were due to a silly error on my part omitting some form entries. This doesn't need to manipulate headers or cookies, or even fill in a fake user agent.
So in terms of review, the usual: use Requests if you can; improve your type safety; avoid dictionaries for internal data.
Suggested
from base64 import b64encode
from datetime import date
from typing import Iterable, ClassVar
from attr import dataclass
from bs4 import BeautifulSoup, SoupStrainer, Tag
from requests import Session
@dataclass
class Result:
caption: str
when: date
path: str
@classmethod
def from_list_item(cls, item: Tag) -> 'Result':
return cls(
caption=item.a.text,
path=item.a['href'],
when=date.fromisoformat(item.find('span', recursive=False).text),
)
class TsinghuaSite:
subdoc: ClassVar[SoupStrainer] = SoupStrainer(name='ul', class_='search_list')
def __init__(self):
self.session = Session()
def __enter__(self) -> 'TsinghuaSite':
return self
def __exit__(self, exc_type, exc_val, exc_tb):
self.session.close()
def search(self, query: str) -> Iterable[Result]:
with self.session.post(
'https://www.ctwx.tsinghua.edu.cn/search.jsp',
params={'wbtreeid': 1001},
data={
'lucenenewssearchkey': b64encode(query.encode()),
'_lucenesearchtype': '1',
'searchScope': '0',
'x': '0',
'y': '0',
},
) as resp:
resp.raise_for_status()
doc = BeautifulSoup(markup=resp.text, features='html.parser', parse_only=self.subdoc)
for item in doc.find('ul', recursive=False).find_all('li', recursive=False):
yield Result.from_list_item(item)
def main():
with TsinghuaSite() as site:
query = '尹至'
results = tuple(site.search(query))
assert any(query in r.caption for r in results)
for result in results:
print(result)
if __name__ == '__main__':
main()
Output
Result(caption='出土文献研究与保护中心2020年报', when=datetime.date(2021, 4, 9), path='info/1041/2615.htm')
Result(caption='《战国秦汉文字与文献论稿》出版', when=datetime.date(2020, 7, 17), path='info/1012/1289.htm')
Result(caption='【光明日报】清华简十年:古书重现与古史新探', when=datetime.date(2018, 12, 25), path='info/1072/1551.htm')
Result(caption='《清華簡與古史探賾》出版', when=datetime.date(2018, 8, 30), path='info/1012/1436.htm')
Result(caption='【出土文獻第九輯】鄔可晶:《尹至》"惟(肉哉)虐德暴(身童)亡典"句試解', when=datetime.date(2018, 5, 24), path='info/1073/1952.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第五輯】袁金平:從《尹至》篇"播"字的討論談文義對文字考釋的重要性', when=datetime.date(2018, 4, 26), path='info/1081/2378.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='【出土文獻第二輯】羅 琨:讀《尹至》"自夏徂亳"', when=datetime.date(2018, 4, 12), path='info/1081/2283.htm')
Result(caption='《出土文獻》(第九輯)出版', when=datetime.date(2016, 10, 26), path='info/1012/1411.htm')
Result(caption='《出土文獻研究》第十三輯出版', when=datetime.date(2015, 4, 8), path='info/1012/1396.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='清華大學藏戰國竹簡第五冊相關研究論文', when=datetime.date(2015, 4, 8), path='info/1081/2215.htm')
Result(caption='《出土文獻》(第五輯)出版', when=datetime.date(2014, 10, 13), path='info/1012/1393.htm')
Result(caption='清华简入选《国家珍贵古籍名录》', when=datetime.date(2013, 12, 11), path='info/1072/1496.htm')
-
\$\begingroup\$ Could you elaborate on what's wrong with using Selenium? Are you talking about it in general or about OP's case in particular? \$\endgroup\$Konstantin Kostanzhoglo– Konstantin Kostanzhoglo2021年08月03日 20:18:45 +00:00Commented Aug 3, 2021 at 20:18
-
1\$\begingroup\$ @KonstantinKostanzhoglo Thanks; I added some nuance. The answer is "don't use Selenium certainly in this case, but also usually in general it should be avoided for scraping unless there are no alternatives". \$\endgroup\$Reinderien– Reinderien2021年08月03日 20:20:57 +00:00Commented Aug 3, 2021 at 20:20
-
2\$\begingroup\$ @KonstantinKostanzhoglo That's kind of not the right question to ask. Instead, you should ask: "Do I really need to worry about scrolling and clicking, or can I bypass the UI entirely?" Bypassing the UI entirely is strongly preferred. \$\endgroup\$Reinderien– Reinderien2021年08月03日 20:41:26 +00:00Commented Aug 3, 2021 at 20:41
-
4\$\begingroup\$ @Sati I rewrote this answer. It turns out it actually is trivially easy to use requests after all; I had just forgotten to include some form fields. Regarding your question on reading - basically google site reverse engineering; there are many many guides on this e.g. on medium. \$\endgroup\$Reinderien– Reinderien2021年08月04日 21:00:48 +00:00Commented Aug 4, 2021 at 21:00
-
1\$\begingroup\$ @AlexDotis Best practice for Python class member variables is to set them on the instance in the
__init__
, rather than them first appearing in another function. So either the session would need to be constructed as anOptional[]
equal toNone
and then written in__enter__
, which is awkward; or just initialized in the constructor. One other difference is that if a caller instantiatesTsinghuaSite
but then does not use context management, the class will still work (whereas it would not if the session were constructed in__enter__
). \$\endgroup\$Reinderien– Reinderien2021年11月06日 16:57:30 +00:00Commented Nov 6, 2021 at 16:57
Explore related questions
See similar questions with these tags.