3
\$\begingroup\$

Scraping tables from the web get complicated when there are 2 or more values in a cell. In order to preserve the table structure, I have devised a way to count the row-number index of its xpath, implementing a nested list when the row number stays the same.

 def get_structured_elements(name):
 """For target data that is nested and structured,
 such as a table with multiple values in a single cell."""
 driver = self.driver
 
 i = 2 # keep track of 'i' to retain the document structure.
 number_of_items = number_of_items_found()
 elements = [None] * number_of_items # len(elements) will exceed number_of_items.
 target_data = driver.find_elements("//table/tbody/tr[" + i + "]/td[2]/a")
 
 while i - 2 < number_of_items:
 for item in target_data:
 # print(item.text, i-1)
 if elements[i - 2] == None:
 elements[i - 2] = item.text # set to item.text value if position is empty. 
 else:
 elements[i - 2] = [elements[i - 2]] 
 elements[i - 2].append(item.text) # make nested list and append new value if position is occupied.
 i += 1
 
 return elements

This simple logic was working fine, until I sought to manage all locator variables in one place to make the code more reusable: How do I store this expression "//table/tbody/tr[" + i + "]/td[2]/a" in a list or dictionary so that it still works when plugged in?

The solution (i.e. hack) I came up with is a function that takes in the front and back half of the iterating xpath as arguments, returning front_half + str(i) + back_half if i is part of the parent (iterator) function's local variable.

def split_xpath_at_i(front_half, back_half):
 """Splits xpath string at its counter index. 
 The 'else' part is to aviod errors 
 when this function is called outside an indexed environment. """
 
 if 'i' in locals():
 string = front_half + str(i) + back_half
 else:
 string = front_half+"SPLIT_i"+back_half
 return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
 "//table/tbody/tr/td[3]/a[1]"
 ]
def xpath_index_iterator():
 for i in range(10):
 print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a
# //table/tbody/tr[SPLIT_i]/td[2]/a

Problem is, split_xpath_at_i is blind to variables in its immediate environment. What I eventually came up with is to make use of the iterator function's attribute to define the counter i so that the variable can be made available to split_xpath_at_i like so:

def split_xpath_at_i(front_half, back_half):
 """Splits xpath string at its counter index. 
 The 'else' part is to aviod errors 
 when this function is called outside an indexed environment. """
 try:
 i = xpath_index_iterator.i
 except:
 pass
 
 if 'i' in locals():
 string = front_half + str(i) + back_half
 else:
 string = front_half+"SPLIT_i"+back_half
 return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
 "//table/tbody/tr/td[3]/a[1]"
 ]
 
def xpath_index_iterator():
 xpath_index_iterator.i = 0
 lst = []
 for xpath_index_iterator.i in range(10):
 print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
xpath_index_iterator()
# //table/tbody/tr[0]/td[2]/a
# //table/tbody/tr[1]/td[2]/a
# //table/tbody/tr[2]/td[2]/a
# //table/tbody/tr[3]/td[2]/a
# //table/tbody/tr[4]/td[2]/a
# //table/tbody/tr[5]/td[2]/a
# //table/tbody/tr[6]/td[2]/a
# //table/tbody/tr[7]/td[2]/a
# //table/tbody/tr[8]/td[2]/a
# //table/tbody/tr[9]/td[2]/a

The problem gets more complicated when I try to invoke split_xpath_at_i via a locator list:

def split_xpath_at_i(front_half, back_half):
 """Splits xpath string at its counter index. 
 The 'else' part is to aviod errors 
 when this function is called outside an indexed environment. """
 try:
 i = xpath_index_iterator.i
 except:
 pass
 
 if 'i' in locals():
 string = front_half + str(i) + back_half
 else:
 string = front_half+"SPLIT_i"+back_half
 return string
xpath = [split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"), 
 "//table/tbody/tr/td[3]/a[1]"
 ]
 
def xpath_index_iterator():
 xpath_index_iterator.i = 0
 lst = []
 for xpath_index_iterator.i in range(10):
# print(split_xpath_at_i("//table/tbody/tr[","]/td[2]/a"))
 lst.append(xpath[0])
 return lst
xpath_index_iterator()
# ['//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a',
# '//table/tbody/tr[9]/td[2]/a']

What would a professional approach to this problem look like?


The Entire Code:

The code below was modified from the Selenium manual.

I've asked a related question over here that concerns the general approach to Page Objects design.

test.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from query import Input
import page
cnki = Input()
driver = cnki.webpage('http://big5.oversea.cnki.net/kns55/')
current_page = page.MainPage(driver)
current_page.submit_search('禮學')
current_page.switch_to_frame()
result = page.SearchResults(driver)
structured = result.get_structured_elements('titles') # I couldn't get this to work.
simple = result.simple_get_structured_elements() # but this works fine.

query.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium import webdriver
class Input:
 """This class provides a wrapper around actual working code."""
 
 # CONSTANTS
 
 URL = None
 
 def __init__(self):
 self.driver = webdriver.Chrome
 
 def webpage(self, url):
 driver = self.driver()
 driver.get(url)
 
 return driver

page.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from element import BasePageElement
from locators import InputLocators, OutputLocators
from selenium.common.exceptions import TimeoutException, WebDriverException
from selenium.common.exceptions import StaleElementReferenceException
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.support.ui import WebDriverWait
class SearchTextElement(BasePageElement):
 """This class gets the search text from the specified locator"""
 #The locator for search box where search string is entered
 locator = None
class BasePage:
 """Base class to initialize the base page that will be called from all
 pages"""
 def __init__(self, driver):
 self.driver = driver
class MainPage(BasePage):
 """Home page action methods come here. I.e. Python.org"""
 search_keyword = SearchTextElement()
 
 def submit_search(self, keyword):
 """Submits keyword and triggers the search"""
 SearchTextElement.locator = InputLocators.SEARCH_FIELD
 self.search_keyword = keyword
 def select_dropdown_item(self, item):
 driver = self.driver
 by, val = InputLocators.SEARCH_ATTR
 driver.find_element(by, val + "/option[text()='" + item + "']").click()
 def click_search_button(self):
 driver = self.driver
 element = driver.find_element(*InputLocators.SEARCH_BUTTON)
 element.click()
 
 def switch_to_frame(self):
 """Use this function to get access to hidden elements. """
 driver = self.driver
 driver.switch_to.default_content()
 driver.switch_to.frame('iframeResult')
 # Maximize the number of items on display in the search results.
 def max_content(self):
 driver = self.driver
 max_content = driver.find_element_by_css_selector('#id_grid_display_num > a:nth-child(3)')
 max_content.click()
 
 
 def stop_loading_page_when_element_is_present(self, locator):
 driver = self.driver
 
 ignored_exceptions = (NoSuchElementException, StaleElementReferenceException)
 wait = WebDriverWait(driver, 30, ignored_exceptions=ignored_exceptions)
 
 wait.until(
 EC.presence_of_element_located(locator))
 driver.execute_script("window.stop();")
 def next_page(self):
 driver = self.driver
 self.stop_loading_page_when_element_is_present(InputLocators.NEXT_PAGE)
 driver.execute_script("window.stop();")
 
 try:
 driver.find_element(*InputLocators.NEXT_PAGE).click()
 print("Navigating to Next Page")
 except (TimeoutException, WebDriverException):
 print("Last page reached")
 
 
 
 
class SearchResults(BasePage):
 """Search results page action methods come here"""
 def __init__(self, driver):
 self.driver = driver
 i = None # get_structured_element counter
 
 def wait_for_page_to_load(self):
 driver = self.driver
 wait = WebDriverWait(driver, 100)
 wait.until(
 EC.presence_of_element_located(*InputLocators.MAIN_BODY))
 
 def get_single_element(self, name):
 """Returns a single value as target data."""
 driver = self.driver
 target_data = driver.find_element(*OutputLocators.CNKI[str(name.upper())])
 # SearchTextElement.locator = OutputLocators.CNKI[str(name.upper())]
 # target_data = SearchTextElement()
 return target_data
 
 def number_of_items_found(self):
 """Return the number of items found on a single page."""
 driver = self.driver
 target_data = driver.find_elements(*OutputLocators.CNKI['INDEX'])
 
 return len(target_data)
 
 def get_elements(self, name):
 """Returns simple list of values in specific data field in a table."""
 driver = self.driver
 target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
 
 elements = []
 for item in target_data:
 elements.append(item.text)
 
 return elements
 def get_structured_elements(self, name):
 """For target data that is nested and structured,
 such as a table with multiple values in a single cell."""
 driver = self.driver
 i = 2 # keep track of 'i' to retain the document structure.
 number_of_items = self.number_of_items_found()
 elements = [None] * number_of_items
 while i - 2 < number_of_items:
 
 target_data = driver.find_elements(*OutputLocators.CNKI[str(name.upper())])
 for item in target_data:
 print(item.text, i - 1)
 if elements[i - 2] == None:
 elements[i - 2] = item.text
 elif isinstance(elements[i - 2], list):
 elements[i - 2].append(item.text)
 else:
 elements[i - 2] = [elements[i - 2]]
 elements[i - 2].append(item.text)
 i += 1
 
 return elements
 
 def simple_get_structured_elements(self):
 """Simple structured elements code with fixed xpath."""
 driver = self.driver
 i = 2 # keep track of 'i' to retain the document structure.
 number_of_items = self.number_of_items_found()
 elements = [None] * number_of_items
 
 while i - 2 < number_of_items:
 target_data = driver.find_elements_by_xpath\
 ('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr['\
 + str(i) + ']/td[2]/a')
 for item in target_data:
 print(item.text, i-1)
 if elements[i - 2] == None:
 elements[i - 2] = item.text
 elif isinstance(elements[i - 2], list):
 elements[i - 2].append(item.text)
 else:
 elements[i - 2] = [elements[i - 2]]
 elements[i - 2].append(item.text)
 i += 1
 return elements

element.py

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.support.ui import WebDriverWait
class BasePageElement():
 """Base page class that is initialized on every page object class."""
 
 def __set__(self, obj, value):
 """Sets the text to the value supplied"""
 driver = obj.driver
 
 text_field = WebDriverWait(driver, 100).until(
 lambda driver: driver.find_element(*self.locator))
 text_field.clear()
 text_field.send_keys(value)
 text_field.submit()
 def __get__(self, obj, owner):
 """Gets the text of the specified object"""
 driver = obj.driver
 
 WebDriverWait(driver, 100).until(
 lambda driver: driver.find_element(*self.locator))
 element = driver.find_element(*self.locator)
 return element.get_attribute("value")

locators.py

This is where split_xpath_at_i sits.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from selenium.webdriver.common.by import By
# import page
class InputLocators():
 """A class for main page locators. All main page locators should come here"""
 
 def dropdown_list_xpath(attribute, value):
 string = "//select[@" + attribute + "='" + value + "']"
 
 return string
 
 MAIN_BODY = (By.XPATH, '//GridTableContent/tbody')
 SEARCH_FIELD = (By.NAME, 'txt_1_value1') # (By.ID, 'search-content-box')
 SEARCH_ATTR = (By.XPATH, dropdown_list_xpath('name', 'txt_1_sel'))
 SEARCH_BUTTON = (By.ID, 'btnSearch')
 NEXT_PAGE = (By.LINK_TEXT, "下頁")
class OutputLocators():
 """A class for search results locators. All search results locators should
 come here"""
 
 def split_xpath_at_i(front_half, back_half):
 # try:
 # i = page.SearchResults.g_s_elem
 # except:
 # pass
 if 'i' in locals():
 string = front_half + str(i) + back_half
 else:
 string = front_half+"SPLIT_i"+back_half
 
 return string
 CNKI = {
 "TITLES": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[2]/a')),
 "AUTHORS": (By.XPATH, split_xpath_at_i('//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr[', ']/td[3]/a')),
 "JOURNALS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[4]/a',
 "YEAR_ISSUE": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[5]/a',
 "DOWNLOAD_PATHS": '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[1]', 
 "INDEX": (By.XPATH, '//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]')
 }
 # # Interim Data
 # CAPTIONS = 
 # LINKS = 
 
 # Target Data
 # TITLES = 
 # AUTHORS = 
 # JOURNALS = 
 # VOL = 
 # ISSUE = 
 # DATES = 
 # DOWNLOAD_PATHS = 
asked Jun 7, 2021 at 16:53
\$\endgroup\$
4
  • 2
    \$\begingroup\$ Can you include all of your scraping code, or at least a representative sample of HTML that you're trying to parse? \$\endgroup\$ Commented Jun 7, 2021 at 16:55
  • \$\begingroup\$ The whole code consist of a couple .py files. Would it be too much to post them here? \$\endgroup\$ Commented Jun 7, 2021 at 16:58
  • 2
    \$\begingroup\$ Nope :) The code length limit is quite high, and for the purposes of a question like this, unless the files are perhaps 1000+ lines each, posting them full-form can only help your question. \$\endgroup\$ Commented Jun 7, 2021 at 17:08
  • \$\begingroup\$ Alright, it's done! \$\endgroup\$ Commented Jun 7, 2021 at 17:30

1 Answer 1

3
\$\begingroup\$

First: I would typically recommend that you replace your use of Selenium with direct requests calls. If it's possible, it's way more efficient than Selenium. It would look like the following, as a very rough start:

from time import time
from typing import Iterable
from urllib.parse import quote
from requests import Session
def js_encode(u: str) -> Iterable[str]:
 for char in u:
 code = ord(char)
 if code < 128:
 yield quote(char).lower()
 else:
 yield f'%u{code:04x}'
def search(query: str):
 topic = '主题'
 # China Academic Literature Online Publishing Database
 catalog = '中国学术文献网络出版总库'
 databases = (
 '中国期刊全文数据库,' # China Academic Journals Full-text Database
 '中国博士学位论文全文数据库,' # China Doctoral Dissertation Full-text Database
 '中国优秀硕士学位论文全文数据库,' # China Master's Thesis Full-text Database
 '中国重要会议论文全文数据库,' # China Proceedings of Conference Full-text Database
 '国际会议论文全文数据库,' # International Proceedings of Conference Full-text Database
 '中国重要报纸全文数据库,' # China Core Newspapers Full-text Database
 '中国年鉴网络出版总库' # China Yearbook Full-text Database
 )
 with Session() as session:
 session.headers = {
 'Accept':
 'text/html,'
 'application/xhtml+xml,'
 'application/xml;q=0.9,'
 'image/webp,'
 '*/*;q=0.8',
 'Accept-Encoding': 'gzip, deflate',
 'Accept-Language': 'en-CA,en-GB;q=0.8,en;q=0.5,en-US;q=0.3',
 'Cache-Control': 'no-cache',
 'Connection': 'keep-alive',
 'DNT': '1',
 'Host': 'big5.oversea.cnki.net',
 'Pragma': 'no-cache',
 'Sec-GPC': '1',
 'User-Agent':
 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:89.0) '
 'Gecko/20100101 '
 'Firefox/89.0',
 'Upgrade-Insecure-Requests': '1',
 }
 with session.get(
 'https://big5.oversea.cnki.net/kns55/brief/result.aspx',
 params={
 'txt_1_value1': query,
 'txt_1_sel': topic,
 'dbPrefix': 'SCDB',
 'db_opt': catalog,
 'db_value': databases,
 'search-action': 'brief/result.aspx',
 },
 ) as response:
 response.raise_for_status()
 search_url = response.url
 search_page = response.text
 encoded_query = ''.join(js_encode(',' + query))
 # epoch milliseconds
 timestamp = round(time()*1000)
 # page_params = {
 # 'curpage': 1,
 # 'RecordsPerPage': 20,
 # 'QueryID': 0,
 # 'ID': '',
 # 'turnpage': 1,
 # 'tpagemode': 'L',
 # 'Fields': '',
 # 'DisplayMode': 'listmode',
 # 'sKuaKuID': 0,
 # }
 with session.get(
 'http://big5.oversea.cnki.net/kns55/brief/brief.aspx',
 params={
 'pagename': 'ASP.brief_result_aspx',
 'dbPrefix': 'SCDB',
 'dbCatalog': catalog,
 'ConfigFile': 'SCDB.xml',
 'research': 'off',
 't': timestamp,
 },
 cookies={
 'FileNameS': quote('cnki:'),
 'KNS_DisplayModel': '',
 'CurTop10KeyWord': encoded_query,
 'RsPerPage': '20',
 },
 headers={
 'Referer': search_url,
 }
 ) as response:
 response.raise_for_status()
 results_iframe = response.text
def main():
 etiquette = '禮學'
 search(query=etiquette)
if __name__ == '__main__':
 main()

Unfortunately, the design of this website is violently awful. State is passed around using a mix of query parameters, cookies, and server-only context that you can't see and relies on request history in a non-trivial way. So even though the above produces, to my knowledge, identical parameters, headers and cookies to those that you see in real life on the website, there's a failure where a couple of dynamically-generated <script> sections in brief.aspx are silently omitted. So I'm giving up on this recommendation.

Shifting gears:

The following recommendations are going to cover scope and class usage, and these should get you toward sanity:

  • The code in test.py needs to be moved into a function

  • Only test.py should have a shebang and none of your other files, since only test.py is a meaningful entry point.

  • Is Input.URL ever used? That probably needs to be deleted

  • Input.webpage should not be returning anything; driver is already a member on the class.

  • Input as a whole is suspect. It provides such a thin wrapper around driver as to be basically useless on its own. I would expect the driver.get() to be moved to MainPage.__init__.

  • InputLocators also does not deserve to be a class. Those constants can basically be distributed to the point of use, i.e.

     wait.until(
     EC.presence_of_element_located(
     By.XPATH, 
     '//GridTableContent/tbody',
     )
     )
    
  • Your search_keyword is strange - you start off initializing it as a static, and then change to using it as an instance variable in submit_search. Why? Also, what is keyword? You would benefit from using PEP484 type hints.

  • switch_to_frame has timing issues and did not work for me at all until I added two waits:

     WebDriverWait(driver, 100).until(
     lambda driver: driver.find_element(
     By.XPATH,
     '//iframe[@name="iframeResult"]',
     ))
     driver.switch_to.frame('iframeResult')
     WebDriverWait(driver, 100).until(
     lambda driver: driver.find_element(
     By.XPATH,
     '//table[@class="GridTableContent"]',
     ))
    
  • Your () at the end of base-less classes can be dropped

  • OutputLocators.CNKI is a dictionary. Why? get_single_element indexes into it, but get_single_element is itself never called.

This code:

 elements = []
 for item in target_data:
 elements.append(item.text)
 return elements
 

can be replaced with a generator:

for item in target_data:
 yield item.text

This code:

 i = None # get_structured_element counter

does nothing since all local variables are discarded at end of scope.

This code:

 if 'i' in locals():
 string = front_half + str(i) + back_half
 else:
 string = front_half+"SPLIT_i"+back_half

is never going to see its first branch evaluated, since i is not defined locally. I really don't know what you intended here.

These long xpath tree traversals, such as

'//*[@id="Form1"]/table/tbody/tr[2]/td/table/tbody/tr/td[1]/table/tbody/tr/td/a[2]'

are both fragile and difficult to read. In most cases you should be able to condense them by a mix of inner // to omit parts of the path, and judicious references to known attributes.

You specifically ask about

split_xpath_at_i is blind to variables in its immediate environment

If by "its immediate environment" you mean CNKI (etc) that's because its immediate environment, the class static scope, has not yet been initialized. CNKI can get a reference to it but not the opposite. If you want this to have some kind of state like a counter, then it needs to be promoted to an instance method with a self parameter. I don't know how g_s_elem factors into this because it's not defined anywhere.

You ask:

The SearchTextElement class with just one locator variable that is hardcoded - is that a good approach?

Not really. First of all, you've again conflated static and instance variables, because you first initialize a static variable to None and then write an instance variable after construction. Why construct a class at all, if it only holds one member and has no methods?

answered Jun 8, 2021 at 1:17
\$\endgroup\$
11
  • \$\begingroup\$ I've tried the requests call method before, too. It got stuck after submitting the search query. Selenium provides the driver.switch_to.frame('iframeResult') to get past that stage and provide access to search result elements. I wonder if there is an equivalent in the requests world. What you are describing seems like a different issue altogether. \$\endgroup\$ Commented Jun 8, 2021 at 1:30
  • \$\begingroup\$ I see that you have results_iframe = response.text in your sample code as well. \$\endgroup\$ Commented Jun 8, 2021 at 1:42
  • \$\begingroup\$ I suspect the "violently awful" design is a deliberate feature to discourage scraping. \$\endgroup\$ Commented Jun 8, 2021 at 1:45
  • 1
    \$\begingroup\$ There are better ways to discourage scraping - I honestly think this is just a product of bad design. For fun, read some of the JS code. It's legacy 12-year-old ASP with a big, crazy mishmash of iframes, AJAX, commented-out code blocks, and developer notes. Bonus points if you can find the comment a developer left confessing to a bad, bad hack. In short, never attribute to malice that which you can attribute to ignorance. \$\endgroup\$ Commented Jun 8, 2021 at 1:50
  • 2
    \$\begingroup\$ @Sati Updating your answer is discouraged. However, posting a new question with your incorporated changes are more than welcome \$\endgroup\$ Commented Jun 8, 2021 at 4:27

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.