unable to parse html table with Beautiful Soup

Question 1

I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe. However, the final result has the correct columns names, but no numbers for the rows. What should I be doing instead?

Here is my code:

from bs4 import BeautifulSoup
import requests
def get_tables(html):
 soup = BeautifulSoup(html, 'html.parser')
 table = soup.find_all('table')
 return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)

Question 2

Can you provide an output of what you are getting when you run the current code. And also can you share what your desired output should be. That will help us provide you some tips.

Question 3

The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:

import json
import requests 
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):

enter image description here

Question 4

How were you able to find cmegroup.com/CmeWS/mvc/Quotes/Future/1/G?

Question 5

@Jojo I looked into Firefox developer tools -> Network tab (Chrome has something similar too). There are all requests the page is doing. One of these requests was this Json file.

Question 6

The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering. You will have to find an alternative way of accessing the data or render the webpages JS (see this example).

A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.

Here is a quick example:

import time 
import pandas as pd 
from selenium.webdriver import Chrome
#Request the dynamically loaded page source 
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame 
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index

Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.

Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.

Andrej Kesely 196k15 gold badges60 silver badges105 bronze badges · Accepted Answer · 2020-10-04 19:15:21Z

The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:

import json
import requests 
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')

Saves data.csv (screenshot from LibreOffice):

enter image description here

How were you able to find cmegroup.com/CmeWS/mvc/Quotes/Future/1/G?
@Jojo I looked into Firefox developer tools -> Network tab (Chrome has something similar too). There are all requests the page is doing. One of these requests was this Json file.

CollectivesTM on Stack Overflow

unable to parse html table with Beautiful Soup

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related