Trouble parsing HTML page with Python

Question 1

I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.

This is what I have so far:

from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)

Thank You

Question 2

What do you have so far?

Question 3

@PatrickRoberts Sorry, just added that.

Question 4

Does it need to be python? It seems the page is static, and if you simply copy/paste the table to a spread sheet, you can easily extract the columns manually. That might be easier. Processing HTML with xpath is not the easiest thing to conquer.

Question 5

@GerardvanHelden Thank You. However, if the page is updated can't I then simply re-download the data through my code? Is there an easier way to process HTML than using xpath?

Question 6

If it is indeed dynamic then you do need some kind of scripting :) BeautifulSoup would have been my next recommendation.

Question 7

Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.

Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.

from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
 if 'SVENY' in th.string:
 desired_columns.append(headers.index(th))
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
 cells= row.findAll('td')
 for column in desired_columns:
 print(cells[column].text)

In response to your second request:

from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
data = {}
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
 if 'SVENY' in th.string:
 data[th.string] = {'column': headers.index(th), 'data': []}
 column_count += 1
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
 date = row.findAll('th')[0].text
 cells= row.findAll('td')
 for header,info in data.items():
 column_number = info['column']
 cell_data = [date,cells[column_number].text]
 info['data'].append(cell_data)

This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.

As an example:

for year_number in data['SVENY01']['data']:
 print(year_number)
['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.

You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.

Question 8

This is great! However, do you know how I could include the header and row Titles to these? And, also include all the rows?

Question 9

Thanks again. Unfortunately, I get an error in the line date = row.findAll('th')[0].text, which is IndexError: list index out of range. I am using Python 2.7 though and am hence using import urllib2 content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) instead of using request. Could this be the issue?

Question 10

Hm, that's interesting. All that date line is doing is, for the row of the table it's on, pulling the data from the th tag, like this one on the site: <th scope="row">2015年06月05日</th> The index out of range error implies to me that row.findAll(th') is returning an empty list, which is strange. Does this occur on the first iteration?

Question 11

If you continue to have the problem, I'd suggest making a new question for it with more info, since I think it would constitute as a separate issue.

Msg 1422 silver badges8 bronze badges · Accepted Answer · 2015-06-09 18:34:23Z

Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.

Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.

from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
 if 'SVENY' in th.string:
 desired_columns.append(headers.index(th))
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
 cells= row.findAll('td')
 for column in desired_columns:
 print(cells[column].text)

In response to your second request:

from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
data = {}
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
 if 'SVENY' in th.string:
 data[th.string] = {'column': headers.index(th), 'data': []}
 column_count += 1
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
 date = row.findAll('th')[0].text
 cells= row.findAll('td')
 for header,info in data.items():
 column_number = info['column']
 cell_data = [date,cells[column_number].text]
 info['data'].append(cell_data)

This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.

As an example:

for year_number in data['SVENY01']['data']:
 print(year_number)
['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.

You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.

This is great! However, do you know how I could include the header and row Titles to these? And, also include all the rows?
Thanks again. Unfortunately, I get an error in the line date = row.findAll('th')[0].text, which is IndexError: list index out of range. I am using Python 2.7 though and am hence using import urllib2 content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) instead of using request. Could this be the issue?
Hm, that's interesting. All that date line is doing is, for the row of the table it's on, pulling the data from the th tag, like this one on the site: <th scope="row">2015年06月05日</th> The index out of range error implies to me that row.findAll(th') is returning an empty list, which is strange. Does this occur on the first iteration?
If you continue to have the problem, I'd suggest making a new question for it with more info, since I think it would constitute as a separate issue.

CollectivesTM on Stack Overflow

Trouble parsing HTML page with Python

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related