0

I want to scrape data from a website (https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players) rendered with javascript. I want to get all the players, and the badge, price, and price change of each player. How do I get all the data from the website after it's been rendered?

I'm trying to render the full page (including the scripts) before I scrape.

from requests_html import HTMLSession
from bs4 import BeautifulSoup
# Assign the URL,
# create the HTMLSession object,
# and run the "get" method to retrieve information from the URL
week = 30
url = f'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-{week}/players'
session = HTMLSession()
response = session.get(url)
# Check that the resolution code was 200
# (successfully retrieved info from URL)
res_code = response.status_code
print(res_code)
if res_code == 200:
 response.html.render() # This is the critical line. This render method runs the script tags to turn them into HTML
 # Get the page content
 soup = BeautifulSoup(response.content, 'lxml')
 print(soup.prettify())
 
else:
 print("Could not reach web page!")

I couldn't use BS4 because the page source does not contain the body (the body is all rendered from javascript). Also, I've seen through the network tab to see which Apis are giving out the data, but it didn't work. I also tried with selenium, but I still don't know how to scrape data from the website.

asked Nov 19, 2022 at 5:01
1
  • What did you try with Selenium? Commented Nov 19, 2022 at 10:11

1 Answer 1

1

Here is one way to get that info with Selenium. It's not fast, however it's reliable, and returns all players (725). Selenium setup is chromedriver/linux, you can adapt it to your own setup, just observe the imports and the code after defining the driver.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time as t
import pandas as pd
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)
chrome_options = Options()
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument('disable-notifications')
chrome_options.add_argument("window-size=1280,720")
webdriver_service = Service("chromedriver/chromedriver") ## path to where you saved chromedriver binary
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
wait = WebDriverWait(driver, 25)
url = 'https://nextgenftl.com/leagues/ftl-main-2022/game-weeks/week-30/players'
big_list = []
driver.get(url)
for x in range(10):
 players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
 for p in players:
 p.location_once_scrolled_into_view
 wait.until(EC.presence_of_element_located((By.TAG_NAME, 'ion-infinite-scroll'))).location_once_scrolled_into_view
 
 t.sleep(1)
players = wait.until(EC.presence_of_all_elements_located((By.XPATH, '//ion-list[not(@id="menu-list")]//ion-item')))
for p in players:
 try:
 p.location_once_scrolled_into_view
 badge = p.find_element(By.XPATH, './/ion-badge').text
 name = p.find_element(By.XPATH, './/ion-label').text
 current_price = p.find_element(By.XPATH, './/div[@title="Current Price"]').text
 price_change = p.find_element(By.XPATH, './/div[@title="Price Change"]').text
 average_points = p.find_element(By.XPATH, './/div[@title="3-Week Average Points"]').text
 events_played = p.find_element(By.XPATH, './/div[@title="Events Played"]').text
 
 big_list.append((badge, name, current_price, price_change, average_points, events_played))
 except Exception as e:
 print('error')
 continue
t.sleep(2)
print(len(big_list))
df = pd.DataFrame(big_list, columns = ['badge', 'name', 'current_price', 'price_change', 'average_points', 'events_played'])
print(df)
df.to_csv('fantasy_tennis.csv')

This will display the dataframe/table in terminal, and also save it as csv:

725
badge name current_price price_change average_points events_played
0 ATP Novak Djokovic 19ドル.864m -- 116.97 7
1 ATP Rafael Nadal 19ドル.295m ↓ 1.137 53.92 9
2 WTA Iga Swiatek 17ドル.835m ↓ 0.074 72.70 13
3 WTA Ashleigh Barty 16ドル.800m -- 169.50 1
4 ATP Carlos Alcaraz 15ドル.587m ↑ 0.494 74.14 14
... ... ... ... ... ... ...
720 WTA Dayana Yastremska 1ドル.450m ↓ 0.068 3.75 14
721 WTA Xiaodi You 1ドル.450m -- 3.77 1
722 WTA Eleana Yu 1ドル.450m -- 2.90 1
723 WTA Anastasia Zakharova 1ドル.450m -- 1.77 1
724 ATP Kacper Zuk 1ドル.450m -- 4.16 1

See Selenium documentation at https://www.selenium.dev/documentation/

answered Nov 19, 2022 at 10:32
Sign up to request clarification or add additional context in comments.

2 Comments

Thank you so much. I know where I need to add, I should call time.sleep so the page will be fully rendered.
You're welcome @Lucas. If my answer solved your issue, don't forget to mark it as accepted (green checkmark under voting buttons). And no, it's not only calling time.sleep. It's also about the elements being scrolled into view... just follow my code.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.