Extract HTML table using Python BeautifulSoup

Question 1

I am trying to extract a table from a web page. Below is the HTML and Python code using beautifulsoup. The code below always worked for me, but in this case I get blank. Thanks in advance.

<table>
<thead>
<tr>
<th>Period Ending:</th>
<th class="TalignL">Trend</th>
<th>9/27/2014</th>
<th>9/28/2013</th>
<th>9/29/2012</th>
<th>9/24/2011</th>
</tr>
</thead>
<tr>
<th bgcolor="#E6E6E6">Total Revenue</th>
<td class="td_genTable"><table border="0" align="center" width="*" cellspacing="0" cellpadding="0"><tr><td align="bottom"><table border="0" height="100%" cellspacing="0" cellpadding="0"><tr><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="15" bgcolor="#47C3D3" width="6"></td><td height="15" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="1" bgcolor="#FFFFFF" width="6"></td><td height="1" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="14" bgcolor="#47C3D3" width="6"></td><td height="14" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="2" bgcolor="#FFFFFF" width="6"></td><td height="2" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="13" bgcolor="#47C3D3" width="6"></td><td height="13" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="2" bgcolor="#D1D1D1"></td></tr></table></td><td><table cellspacing="0" cellpadding="0" border="0"><tr><td height="7" bgcolor="#FFFFFF" width="6"></td><td height="7" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="8" bgcolor="#47C3D3" width="6"></td><td height="8" bgcolor="#FFFFFF" width="1px"></td></tr><tr><td height="1" colspan="1" bgcolor="#D1D1D1"></td></tr></table></td></tr></table></td></tr></table></td>
<td>182,795,000ドル</td>
<td>170,910,000ドル</td>
<td>156,508,000ドル</td>
<td>108,249,000ドル</td>

 rows = table.findAll('tr')
 for row in rows:
 cols = row.findAll('td')
 col1 = [ele.text.strip().replace(',','') for ele in cols]
 account = col1[0:1]
 period1 = col1[2:3]
 period2 = col1[3:4]
 period3 = col1[4:5]
 record = (stock, account,period1,period3,period3)
 print record

Question 2

Your first column of your first non-header row contains a table full of empty cells with no text in them. Your code is correctly finding that no text. I'm not sure what you wanted it to do instead.

Question 3

Meanwhile, why are you using the deprecated name findAll? Are you learning from sample code written for BS3 instead of from updated samples or documentation for BS4?

Question 4

Finally, find_all (or findAll) searches through all descendants, not just the top-level children. So, unless you want to iterate through both the rows of the outer table and the rows of every subtable embedded inside a column of that table and treat them the same, you shouldn't be using it here.

Question 5

Adding to what @abarnert pointed out. I would get all the td elements with text starting with $:

for row in soup.table.find_all('tr', recursive=False):
 record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
 print record

For the input you've provided, it prints:

[u'182795000ドル', u'170910000ドル', u'156508000ドル', u'108249000ドル']

which you can "unpack" into separate variables:

account, period1, period2, period3 = record

Note that I'm explicitly passing recursive=False to avoid going deeper in the tree and get only direct tr children of the table element.

Question 6

Your first problem is that find_all (or findAll, which is just a deprecated synonym for the same thing) doesn't just find the rows in the table, it finds the rows in the table and in every subtable within it. You almost certainly don't want to iterate over both kinds of rows and run the same code on each one. If you don't want that, as the recursive argument docs say, pass recursive=False.

So, now you get back only one row. If you do row.find_all('td'), that's going to have the same problem again—you're going to find all of the columns of this row, and all of the columns of every row in every subtable within one of those columns. Again, that's not what you want, so use recursive=False.

And now you get back only 5 columns. The first one is just a big table with a bunch of empty cells in it; the others, on the other hand, have dollar values in them, which seem to be the ones you want.

So, just adding recursive=False to both calls, and setting stock to something (I don't know where it's supposed to come from in your code, but without it you're obviously going to just get a NameError):

stock = 'spam'
rows = table.find_all('tr', recursive=False)
for row in rows:
 cols = row.findAll('td', recursive=False)
 col1 = [ele.text.strip().replace(',','') for ele in cols]
 account = col1[0:1]
 period1 = col1[2:3]
 period2 = col1[3:4]
 period3 = col1[4:5]
 record = (stock, account,period1,period3,period3)
 print record

This will print:

('spam', [''], ['170910000ドル'], ['108249000ドル'], ['108249000ドル'])

I'm not sure why you used period3 twice and never used period2, why you skipped over column 1 entirely, or why you sliced 1-element lists instead of just indexing the values, but anyway, this seems to be what you were trying to do.

As a side note, if you actually want to break out the list into 5 values, rather than into 4 1-element lists skipping one of the values, you can write:

account, whatever, period1, period2, period3 = col

Question 7

@alecxe: To write books, you have to be able to edit things down. My chapter 1 would be 1300 pages. (That might work for a novel, but Tristam Shandy was already written 250 years ago...)

Question 8

This worked!! Thank you. You are right. I am using stock as variable to loop through multiple stocks. Thanks again.

alecxe 476k127 gold badges1.1k silver badges1.2k bronze badges · Accepted Answer · 2015-05-10 17:32:25Z

Adding to what @abarnert pointed out. I would get all the td elements with text starting with $:

for row in soup.table.find_all('tr', recursive=False):
 record = [td.text.replace(",", "") for td in row.find_all("td", text=lambda x: x and x.startswith("$"))]
 print record

For the input you've provided, it prints:

[u'182795000ドル', u'170910000ドル', u'156508000ドル', u'108249000ドル']

which you can "unpack" into separate variables:

account, period1, period2, period3 = record

Note that I'm explicitly passing recursive=False to avoid going deeper in the tree and get only direct tr children of the table element.

CollectivesTM on Stack Overflow

Extract HTML table using Python BeautifulSoup

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related