I am trying to extract a table from a web page. Below is the HTML and Python code using beautifulsoup. The code below always worked for me, but in this case I get blank. Thanks in advance.
<table>
<thead>
<tr>
<th>Period Ending:</th>
<th class="TalignL">Trend</th>
<th>9/27/2014</th>
<th>9/28/2013</th>
<th>9/29/2012</th>
<th>9/24/2011</th>
</tr>
</thead>
<tr>
<th bgcolor="#E6E6E6">Total Revenue</th>
<td class="td_genTable"><table border="0" align="center" width="*" cellspacing="0" cellpadding="0"><tr><td align="bottom"><table border="0" height="100%" cellspacing="0" cellpadding="0"><tr><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="15" bgcolor="#47C3D3" width="6"></td><td height="15" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="1" bgcolor="#FFFFFF" width="6"></td><td height="1" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="14" bgcolor="#47C3D3" width="6"></td><td height="14" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="2" bgcolor="#FFFFFF" width="6"></td><td height="2" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="13" bgcolor="#47C3D3" width="6"></td><td height="13" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="7" bgcolor="#FFFFFF" width="6"></td><td height="7" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="8" bgcolor="#47C3D3" width="6"></td><td height="8" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="1" bgcolor="#D1D1D1"></td></tr></table></td></tr></table></td></tr></table></td>
<td>182,795,000ドル</td>
<td>170,910,000ドル</td>
<td>156,508,000ドル</td>
<td>108,249,000ドル</td>
rows = table.findAll('tr')
for row in rows:
cols = row.findAll('td')
col1 = [ele.text.strip().replace(',','') for ele in cols]
account = col1[0:1]
period1 = col1[2:3]
period2 = col1[3:4]
period3 = col1[4:5]
record = (stock, account,period1,period3,period3)
print record
2 Answers 2
Adding to what @abarnert pointed out. I would get all the td elements with text starting with $:
for row in soup.table.find_all('tr', recursive=False):
record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
print record
For the input you've provided, it prints:
[u'182795000ドル', u'170910000ドル', u'156508000ドル', u'108249000ドル']
which you can "unpack" into separate variables:
account, period1, period2, period3 = record
Note that I'm explicitly passing recursive=False to avoid going deeper in the tree and get only direct tr children of the table element.
Comments
Your first problem is that find_all (or findAll, which is just a deprecated synonym for the same thing) doesn't just find the rows in the table, it finds the rows in the table and in every subtable within it. You almost certainly don't want to iterate over both kinds of rows and run the same code on each one. If you don't want that, as the recursive argument docs say, pass recursive=False.
So, now you get back only one row. If you do row.find_all('td'), that's going to have the same problem again—you're going to find all of the columns of this row, and all of the columns of every row in every subtable within one of those columns. Again, that's not what you want, so use recursive=False.
And now you get back only 5 columns. The first one is just a big table with a bunch of empty cells in it; the others, on the other hand, have dollar values in them, which seem to be the ones you want.
So, just adding recursive=False to both calls, and setting stock to something (I don't know where it's supposed to come from in your code, but without it you're obviously going to just get a NameError):
stock = 'spam'
rows = table.find_all('tr', recursive=False)
for row in rows:
cols = row.findAll('td', recursive=False)
col1 = [ele.text.strip().replace(',','') for ele in cols]
account = col1[0:1]
period1 = col1[2:3]
period2 = col1[3:4]
period3 = col1[4:5]
record = (stock, account,period1,period3,period3)
print record
This will print:
('spam', [''], ['170910000ドル'], ['108249000ドル'], ['108249000ドル'])
I'm not sure why you used period3 twice and never used period2, why you skipped over column 1 entirely, or why you sliced 1-element lists instead of just indexing the values, but anyway, this seems to be what you were trying to do.
As a side note, if you actually want to break out the list into 5 values, rather than into 4 1-element lists skipping one of the values, you can write:
account, whatever, period1, period2, period3 = col
findAll? Are you learning from sample code written for BS3 instead of from updated samples or documentation for BS4?find_all(orfindAll) searches through all descendants, not just the top-level children. So, unless you want to iterate through both the rows of the outer table and the rows of every subtable embedded inside a column of that table and treat them the same, you shouldn't be using it here.