I am very new to using Beautiful Soup and I'm trying to import data from the below url as a pandas dataframe. However, the final result has the correct columns names, but no numbers for the rows. What should I be doing instead?
Here is my code:
from bs4 import BeautifulSoup
import requests
def get_tables(html):
soup = BeautifulSoup(html, 'html.parser')
table = soup.find_all('table')
return pd.read_html(str(table))[0]
url = 'https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html'
html = requests.get(url).content
get_tables(html)
-
Can you provide an output of what you are getting when you run the current code. And also can you share what your desired output should be. That will help us provide you some tips.Joe Ferndz– Joe Ferndz2020年10月04日 19:08:38 +00:00Commented Oct 4, 2020 at 19:08
2 Answers 2
The data you see in the table is loaded from another URL via JavaScript. You can use this example to save the data to csv:
import json
import requests
import pandas as pd
data = requests.get('https://www.cmegroup.com/CmeWS/mvc/Quotes/Future/1/G').json()
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
df = pd.json_normalize(data['quotes'])
df.to_csv('data.csv')
Saves data.csv (screenshot from LibreOffice):
2 Comments
The website you're trying to scrape data from is rendering the table values dynamically and using requests.get will only return the HTML the server sends prior to JavaScript rendering.
You will have to find an alternative way of accessing the data or render the webpages JS (see this example).
A common way of doing this is to use selenium to automate a browser which allows you to render the JavaScript and get the source code that way.
Here is a quick example:
import time
import pandas as pd
from selenium.webdriver import Chrome
#Request the dynamically loaded page source
c = Chrome(r'/path/to/webdriver.exe')
c.get('https://www.cmegroup.com/trading/interest-rates/stir/eurodollar.html')
#Wait for it to render in browser
time.sleep(5)
html_data = c.page_source
#Load into pd.DataFrame
tables = pd.read_html(html_data)
df = tables[0]
df.columns = df.columns.droplevel() #Convert the MultiIndex to an Index
Note that I didn't use BeautifulSoup, you can directly pass the html to pd.read_html. You'll have to do some more cleaning from there but that's the gist.
Alternatively, you can take a peak at requests-html which is a library that offers JavaScript rendering and might be able to help, search for a way to access the data as JSON or .csv from elsewhere and use that, etc.
Comments
Explore related questions
See similar questions with these tags.