I'm trying to get a hold of the data under the columns having the code "SEVNYXX", where "XX" are the numbers that follow (eg. 01, 02, etc) on the site http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html using Python. I am currently using the following method as prescribed by the site http://docs.python-guide.org/en/latest/scenarios/scrape/ . However, I don't know how to determine the divs for this page and am hence unable to proceed and was hoping to get some help with this.
This is what I have so far:
from lxml import html
import requests
page = requests.get('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html')
tree = html.fromstring(page.text)
Thank You
-
1What do you have so far?Patrick Roberts– Patrick Roberts2015年06月09日 17:12:29 +00:00Commented Jun 9, 2015 at 17:12
-
@PatrickRoberts Sorry, just added that.user131983– user1319832015年06月09日 17:16:51 +00:00Commented Jun 9, 2015 at 17:16
-
Does it need to be python? It seems the page is static, and if you simply copy/paste the table to a spread sheet, you can easily extract the columns manually. That might be easier. Processing HTML with xpath is not the easiest thing to conquer.Gerard van Helden– Gerard van Helden2015年06月09日 17:22:33 +00:00Commented Jun 9, 2015 at 17:22
-
@GerardvanHelden Thank You. However, if the page is updated can't I then simply re-download the data through my code? Is there an easier way to process HTML than using xpath?user131983– user1319832015年06月09日 17:31:26 +00:00Commented Jun 9, 2015 at 17:31
-
If it is indeed dynamic then you do need some kind of scripting :) BeautifulSoup would have been my next recommendation.Gerard van Helden– Gerard van Helden2015年06月12日 18:23:15 +00:00Commented Jun 12, 2015 at 18:23
1 Answer 1
Have you tried using BeautifulSoup? I'm a pretty big fan. Using that you can easily iterate through all of the info you want, searching by tag.
Here's something I threw together, that prints out the values in each column you are looking at. Not sure what you want to do with the data, but hopefully it helps.
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
for th in headers:
if 'SVENY' in th.string:
desired_columns.append(headers.index(th))
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
cells= row.findAll('td')
for column in desired_columns:
print(cells[column].text)
In response to your second request:
from bs4 import BeautifulSoup
from urllib import request
page = request.urlopen('http://www.federalreserve.gov/econresdata/researchdata/feds200628_1.html').read()
soup = BeautifulSoup(page)
desired_table = soup.findAll('table')[2]
data = {}
# Find the columns you want data from
headers = desired_table.findAll('th')
desired_columns = []
column_count = 0
for th in headers:
if 'SVENY' in th.string:
data[th.string] = {'column': headers.index(th), 'data': []}
column_count += 1
# Iterate through each row grabbing the data from the desired columns
rows = desired_table.findAll('tr')
for row in rows[1:]:
date = row.findAll('th')[0].text
cells= row.findAll('td')
for header,info in data.items():
column_number = info['column']
cell_data = [date,cells[column_number].text]
info['data'].append(cell_data)
This returns a dictionary where each key is the header for a column, and each value is another dictionary that has 1) the column it's in on the site, and 2) the actual data you want, in a list of lists.
As an example:
for year_number in data['SVENY01']['data']:
print(year_number)
['2015-06-05', '0.3487']
['2015-06-04', '0.3124']
['2015-06-03', '0.3238']
['2015-06-02', '0.3040']
['2015-06-01', '0.3009']
['2015-05-29', '0.2957']
etc.
You can fiddle around with this to get the info how and where you want it, but hopefully this is helpful.
4 Comments
date = row.findAll('th')[0].text, which is IndexError: list index out of range. I am using Python 2.7 though and am hence using import urllib2 content = urllib2.urlopen(url).read() soup = BeautifulSoup(content) instead of using request. Could this be the issue?date line is doing is, for the row of the table it's on, pulling the data from the th tag, like this one on the site: <th scope="row">2015年06月05日</th> The index out of range error implies to me that row.findAll(th') is returning an empty list, which is strange. Does this occur on the first iteration?