Scraping tables from the web get complicated when there are 2 or more values in a cell. In order to preserve the table structure, I have devised a way to count the row-number index of its xpath, implementing a nested list when the row number stays the same.
def get_structured_elements(name):
"""For target data that is nested and structured,
such as a table with multiple values in a single cell."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = number_of_items_found()
elements = [None] * number_of_items # len(elements) will exceed number_of_items.
target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
while i - 2 < number_of_items:
for item in target_data:
# print(item.text, i-1)
if elements[i - 2] == None:
elements[i - 2] = item.text # set to item.text value if position is empty.
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
i += 1
return elements
This simple logic was working fine, until I sought to manage all locator variables in one place to make the code more reusable: How do I store this expression "//table/tbody/tr[" + i + "]/td[2]/a"
in a list or dictionary so that it still works when plugged in?
The solution (i.e. hack) I came up with is a function that takes in the front and back half of the iterating xpath as arguments, returning front_half + str(i) + back_half
if i
is part of the parent (iterator) function's local variable.
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
for i in range(10):
print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
Problem is, split_xpath_at_i
is blind to variables in its immediate environment. What I eventually came up with is to make use of the iterator function's attribute to define the counter i
so that the variable can be made available to split_xpath_at_i
like so:
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
try:
i = xpath_index_iterator.i
except:
pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
xpath_index_iterator.i = 0
lst = []
for xpath_index_iterator.i in range(10):
print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a
The problem gets more complicated when I try to invoke split_xpath_at_i
via a locator list:
def split_xpath_at_i(front_half, back_half):
"""Splits xpath string at its counter index.
The 'else' part is to aviod errors
when this function is called outside an indexed environment. """
try:
i = xpath_index_iterator.i
except:
pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"),
"//table/tbody/tr/td[3]/a[1]"
]
def xpath_index_iterator():
xpath_index_iterator.i = 0
lst = []
for xpath_index_iterator.i in range(10):
# print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
lst.append(xpath[0])
return lst
xpath_index_iterator()
# ['//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a']
What would a professional approach to this problem look like?
The Entire Code:
The code below was modified from the Selenium manual.
I've asked a related question over here that concerns the general approach to Page Objects design.
test.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from query import Input
import page
cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')
current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)
structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.
query.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver
class Input:
"""This class provides a wrapper around actual working code."""
# CONSTANTS
URL = None
def __init__(self):
self.driver = webdriver.Chrome
def webpage(self, url):
driver = self.driver()
driver.get(url)
return driver
page.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
class SearchTextElement(BasePageElement):
"""This class gets the search text from the specified locator"""
#The locator for search box where search string is entered
locator = None
class BasePage:
"""Base class to initialize the base page that will be called from all
pages"""
def __init__(self, driver):
self.driver = driver
class MainPage(BasePage):
"""Home page action methods come here. I.e. Python.org"""
search_keyword = SearchTextElement()
def submit_search(self, keyword):
"""Submits keyword and triggers the search"""
SearchTextElement.locator = InputLocators.SEARCH_FIELD
self.search_keyword = keyword
def select_dropdown_item(self, item):
driver = self.driver
by, val = InputLocators.SEARCH_ATTR
driver.find_element(by, val + "/option[text()='" + item + "']").click()
def click_search_button(self):
driver = self.driver
element = driver.find_element(*InputLocators.SEARCH_BUTTON)
element.click()
def switch_to_frame(self):
"""Use this function to get access to hidden elements. """
driver = self.driver
driver.switch_to.default_content()
driver.switch_to.frame('iframeResult')
# Maximize the number of items on display in the search results.
def max_content(self):
driver = self.driver
max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
max_content.click()
def stop_loading_page_when_element_is_present(self, locator):
driver = self.driver
ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
wait.until(
EC.presence_of_element_located(locator))
driver.execute_script("window.stop();")
def next_page(self):
driver = self.driver
self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
driver.execute_script("window.stop();")
try:
driver.find_element(*InputLocators.NEXT_PAGE).click()
print("Navigating to Next Page")
except (TimeoutException, WebDriverException):
print("Last page reached")
class SearchResults(BasePage):
"""Search results page action methods come here"""
def __init__(self, driver):
self.driver = driver
i = None # get_structured_element counter
def wait_for_page_to_load(self):
driver = self.driver
wait = WebDriverWait(driver, 100)
wait.until(
EC.presence_of_element_located(*InputLocators.MAIN_BODY))
def get_single_element(self, name):
"""Returns a single value as target data."""
driver = self.driver
target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
# SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
# target_data = SearchTextElement()
return target_data
def number_of_items_found(self):
"""Return the number of items found on a single page."""
driver = self.driver
target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
return len(target_data)
def get_elements(self, name):
"""Returns simple list of values in specific data field in a table."""
driver = self.driver
target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
elements = []
for item in target_data:
elements.append(item.text)
return elements
def get_structured_elements(self, name):
"""For target data that is nested and structured,
such as a table with multiple values in a single cell."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = self.number_of_items_found()
elements = [None] * number_of_items
while i - 2 < number_of_items:
target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
for item in target_data:
print(item.text, i - 1)
if elements[i - 2] == None:
elements[i - 2] = item.text
elif isinstance(elements[i - 2], list):
elements[i - 2].append(item.text)
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text)
i += 1
return elements
def simple_get_structured_elements(self):
"""Simple structured elements code with fixed xpath."""
driver = self.driver
i = 2 # keep track of 'i' to retain the document structure.
number_of_items = self.number_of_items_found()
elements = [None] * number_of_items
while i - 2 < number_of_items:
target_data = driver.find_elements_by_xpath\
('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
+ str(i) + ']/td[2]/a')
for item in target_data:
print(item.text, i-1)
if elements[i - 2] == None:
elements[i - 2] = item.text
elif isinstance(elements[i - 2], list):
elements[i - 2].append(item.text)
else:
elements[i - 2] = [elements[i - 2]]
elements[i - 2].append(item.text)
i += 1
return elements
element.py
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait
class BasePageElement():
"""Base page class that is initialized on every page object class."""
def __set__(self, obj, value):
"""Sets the text to the value supplied"""
driver = obj.driver
text_field = WebDriverWait(driver, 100).until(
lambda driver: driver.find_element(*self.locator))
text_field.clear()
text_field.send_keys(value)
text_field.submit()
def __get__(self, obj, owner):
"""Gets the text of the specified object"""
driver = obj.driver
WebDriverWait(driver, 100).until(
lambda driver: driver.find_element(*self.locator))
element = driver.find_element(*self.locator)
return element.get_attribute("value")
locators.py
This is where split_xpath_at_i
sits.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page
class InputLocators():
"""A class for main page locators. All main page locators should come here"""
def dropdown_list_xpath(attribute, value):
string = "//select[@" + attribute + "='" + value + "']"
return string
MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
SEARCH_BUTTON = (By.ID, 'btnSearch')
NEXT_PAGE = (By.LINK_TEXT, "下頁")
class OutputLocators():
"""A class for search results locators. All search results locators should
come here"""
def split_xpath_at_i(front_half, back_half):
# try:
# i = page.SearchResults.g_s_elem
# except:
# pass
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
return string
CNKI = {
"TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
"AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
"JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
"YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
"DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]',
"INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
}
# # Interim Data
# CAPTIONS =
# LINKS =
# Target Data
# TITLES =
# AUTHORS =
# JOURNALS =
# VOL =
# ISSUE =
# DATES =
# DOWNLOAD_PATHS =
1 Answer 1
First: I would typically recommend that you replace your use of Selenium with direct requests
calls. If it's possible, it's way more efficient than Selenium. It would look like the following, as a very rough start:
from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session
def js_encode(u: str) -> Iterable[str]:
for char in u:
code = ord(char)
if code < 128:
yield quote(char).lower()
else:
yield f'%u{code:04x}'
def search(query: str):
topic = '主题'
# China Academic Literature Online Publishing Database
catalog = '中国学术文献网络出版总库'
databases = (
'中国期刊全文数据库,' # China Academic Journals Full-text Database
'中国博士学位论文全文数据库,' # China Doctoral Dissertation Full-text Database
'中国优秀硕士学位论文全文数据库,' # China Master's Thesis Full-text Database
'中国重要会议论文全文数据库,' # China Proceedings of Conference Full-text Database
'国际会议论文全文数据库,' # International Proceedings of Conference Full-text Database
'中国重要报纸全文数据库,' # China Core Newspapers Full-text Database
'中国年鉴网络出版总库' # China Yearbook Full-text Database
)
with Session() as session:
session.headers = {
'Accept':
'text/html,'
'application/xhtml+xml,'
'application/xml;q=0.9,'
'image/webp,'
'*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
'Cache-Control': 'no-cache',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'big5.oversea.cnki.net',
'Pragma': 'no-cache',
'Sec-GPC': '1',
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
'Gecko/20100101 '
'Firefox/89.0',
'Upgrade-Insecure-Requests': '1',
}
with session.get(
'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
params={
'txt_1_value1': query,
'txt_1_sel': topic,
'dbPrefix': 'SCDB',
'db_opt': catalog,
'db_value': databases,
'search-action': 'brief/result.aspx',
},
) as response:
response.raise_for_status()
search_url = response.url
search_page = response.text
encoded_query = ''.join(js_encode(',' + query))
# epoch milliseconds
timestamp = round(time()*1000)
# page_params = {
# 'curpage': 1,
# 'RecordsPerPage': 20,
# 'QueryID': 0,
# 'ID': '',
# 'turnpage': 1,
# 'tpagemode': 'L',
# 'Fields': '',
# 'DisplayMode': 'listmode',
# 'sKuaKuID': 0,
# }
with session.get(
'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
params={
'pagename': 'ASP.brief_result_aspx',
'dbPrefix': 'SCDB',
'dbCatalog': catalog,
'ConfigFile': 'SCDB.xml',
'research': 'off',
't': timestamp,
},
cookies={
'FileNameS': quote('cnki:'),
'KNS_DisplayModel': '',
'CurTop10KeyWord': encoded_query,
'RsPerPage': '20',
},
headers={
'Referer': search_url,
}
) as response:
response.raise_for_status()
results_iframe = response.text
def main():
etiquette = '禮學'
search(query=etiquette)
if __name__ == '__main__':
main()
Unfortunately, the design of this website is violently awful. State is passed around using a mix of query parameters, cookies, and server-only context that you can't see and relies on request history in a non-trivial way. So even though the above produces, to my knowledge, identical parameters, headers and cookies to those that you see in real life on the website, there's a failure where a couple of dynamically-generated <script>
sections in brief.aspx
are silently omitted. So I'm giving up on this recommendation.
Shifting gears:
The following recommendations are going to cover scope and class usage, and these should get you toward sanity:
The code in
test.py
needs to be moved into a functionOnly
test.py
should have a shebang and none of your other files, since onlytest.py
is a meaningful entry point.Is
Input.URL
ever used? That probably needs to be deletedInput.webpage
should not be returning anything;driver
is already a member on the class.Input
as a whole is suspect. It provides such a thin wrapper arounddriver
as to be basically useless on its own. I would expect thedriver.get()
to be moved toMainPage.__init__
.InputLocators
also does not deserve to be a class. Those constants can basically be distributed to the point of use, i.e.wait.until( EC.presence_of_element_located( By.XPATH, '//GridTableContent/tbody', ) )
Your
search_keyword
is strange - you start off initializing it as a static, and then change to using it as an instance variable insubmit_search
. Why? Also, what iskeyword
? You would benefit from using PEP484 type hints.switch_to_frame
has timing issues and did not work for me at all until I added two waits:WebDriverWait(driver, 100).until( lambda driver: driver.find_element( By.XPATH, '//iframe[@name="iframeResult"]', )) driver.switch_to.frame('iframeResult') WebDriverWait(driver, 100).until( lambda driver: driver.find_element( By.XPATH, '//table[@class="GridTableContent"]', ))
Your
()
at the end of base-less classes can be droppedOutputLocators.CNKI
is a dictionary. Why?get_single_element
indexes into it, butget_single_element
is itself never called.
This code:
elements = []
for item in target_data:
elements.append(item.text)
return elements
can be replaced with a generator:
for item in target_data:
yield item.text
This code:
i = None # get_structured_element counter
does nothing since all local variables are discarded at end of scope.
This code:
if 'i' in locals():
string = front_half + str(i) + back_half
else:
string = front_half+"SPLIT_i"+back_half
is never going to see its first branch evaluated, since i
is not defined locally. I really don't know what you intended here.
These long xpath tree traversals, such as
'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'
are both fragile and difficult to read. In most cases you should be able to condense them by a mix of inner //
to omit parts of the path, and judicious references to known attributes.
You specifically ask about
split_xpath_at_i
is blind to variables in its immediate environment
If by "its immediate environment" you mean CNKI
(etc) that's because its immediate environment, the class static scope, has not yet been initialized. CNKI
can get a reference to it but not the opposite. If you want this to have some kind of state like a counter, then it needs to be promoted to an instance method with a self
parameter. I don't know how g_s_elem
factors into this because it's not defined anywhere.
You ask:
The SearchTextElement class with just one locator variable that is hardcoded - is that a good approach?
Not really. First of all, you've again conflated static and instance variables, because you first initialize a static variable to None
and then write an instance variable after construction. Why construct a class at all, if it only holds one member and has no methods?
-
\$\begingroup\$ I've tried the
requests
call method before, too. It got stuck after submitting the search query. Selenium provides thedriver.switch_to.frame('iframeResult')
to get past that stage and provide access to search result elements. I wonder if there is an equivalent in therequests
world. What you are describing seems like a different issue altogether. \$\endgroup\$Sati– Sati2021年06月08日 01:30:53 +00:00Commented Jun 8, 2021 at 1:30 -
\$\begingroup\$ I see that you have
results_iframe = response.text
in your sample code as well. \$\endgroup\$Sati– Sati2021年06月08日 01:42:48 +00:00Commented Jun 8, 2021 at 1:42 -
\$\begingroup\$ I suspect the "violently awful" design is a deliberate feature to discourage scraping. \$\endgroup\$Sati– Sati2021年06月08日 01:45:33 +00:00Commented Jun 8, 2021 at 1:45
-
1\$\begingroup\$ There are better ways to discourage scraping - I honestly think this is just a product of bad design. For fun, read some of the JS code. It's legacy 12-year-old ASP with a big, crazy mishmash of iframes, AJAX, commented-out code blocks, and developer notes. Bonus points if you can find the comment a developer left confessing to a bad, bad hack. In short, never attribute to malice that which you can attribute to ignorance. \$\endgroup\$Reinderien– Reinderien2021年06月08日 01:50:44 +00:00Commented Jun 8, 2021 at 1:50
-
2\$\begingroup\$ @Sati Updating your answer is discouraged. However, posting a new question with your incorporated changes are more than welcome \$\endgroup\$N3buchadnezzar– N3buchadnezzar2021年06月08日 04:27:47 +00:00Commented Jun 8, 2021 at 4:27
.py
files. Would it be too much to post them here? \$\endgroup\$