How to parse a HTML file with table using Python

Question 1

I have got a html file with table ( its a large one, so only sample code is given ). I want to retrieve the values in tables. I tried the HTMLParser library from python.

I started coding like below. Then I found that the attribute "class" is same as system defined keyword. So its giving me error.

class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 if tag == 'tr':
 for class in attrs:
 if class == 'Table_row'
p = MyHTMLParser()
p.feed(ht)

HTML code for table

<table class="Table_rows" cellspacing="0" rules="all" border="1" id="MyDataGrid" style="width:700px;border-collapse:collapse;">
 <tr class="Table_Heading">
 <td>STATION CODE</td><td>STATION NAME</td><td>SCHEDULED ARRIVAL</td><td>SCHEDULED DEPARTURE</td><td>ACTUAL/ EXPECTED ARRIVAL</td><td>ACTUAL/ EXPECTED DEPARTURE</td>
 </tr><tr class="Table_row">
 <td>TVC </td><td style="width:160px;">ORIGON</td><td>Starting Station </td><td>05:00, 07 May 2011</td><td>Starting Station</td><td>05:00, 07 May 2011</td>
 </tr><tr class="alternat_table_row">
 <td>TVP </td><td>NEY YORK</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td><td>05:04, 07 May 2011</td><td>05:05, 07 May 2011</td>
</tr> 
</table>

UPDATE

How could I get data between the tags?

Question 2

I wrote a small and simple HTML table parser not requiring any external module: github.com/schmijos/html-table-parser-python3/blob/master/…

Question 3

Note that the documentation of the handle_starttag method states:

The tag argument is the name of the tag converted to lower case. The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets.

So, you're probably looking for something like:

from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 if tag == 'tr':
 for name, value in attrs:
 if name == 'class':
 print 'Found class', value
p = MyHTMLParser()
p.feed(ht)

Prints:

Found class Table_Heading
Found class Table_row
Found class alternat_table_row

P.S. I also recommend BeautifulSoup for parsing HTML with Python.

Question 4

How to print the values like STATION CODE STATION NAME ORIGON ...?

Question 5

@user567879: you can find td tags and process them

Question 6

Sorry for the dumb question. What I want is that I need to process ( couldnt find a method to print the value b/w tags) td tag between the table tag. How to handle that nesting?

Question 7

@user567879: not sure what you're asking exactly... Still I recommend to take a look at BeautifulSoup - it provides a much higher-level API for HTML processing. HTMLParser is quite awkward to use compared to it

Question 8

I want to print the data in tr tag when class=Table_Heading or class=Table_row or class=alternate_table_row only. Can I use and cluause to make it work?

Question 9

How to print the values like STATION CODE STATION NAME ORIGON ...?.

You can do it like this with BeautifulSoup.

from BeautifulSoup import BeautifulSoup
html = '''\
<td>STATION CODE</td><td>STATION NAME</td><td>SCHEDULED ARRIVAL</td><td>SCHEDULED DEPARTURE</td><td>ACTUAL/ EXPECTED ARRIVAL</td><td>ACTUAL/ EXPECTED DEPARTURE</td>
</tr><tr class="Table_row">
<td>TVC </td><td style="width:160px;">ORIGON</td><td>Starting Station </td><td>05:00, 07 May 2011</td><td>Starting Station</td><td>05:00, 07 May 2011</td>
'''
soup = BeautifulSoup(html)
tag = soup.findAll('td', limit=2)
tag_O = soup.findAll('td')[7]
for i in range(len(tag)):
 print tag[i].string
print tag_O.string
'''Output-->
STATION CODE
STATION NAME
ORIGON
'''

Question 10

I would highly recommend using the BeautifulSoup library. It handles even broken HTML with ease.

http://www.crummy.com/software/BeautifulSoup/

Eli Bendersky 276k92 gold badges372 silver badges427 bronze badges · Accepted Answer · 2011-05-07 11:40:20Z

4

Note that the documentation of the handle_starttag method states:

The tag argument is the name of the tag converted to lower case. The attrs argument is a list of (name, value) pairs containing the attributes found inside the tag’s <> brackets.

So, you're probably looking for something like:

from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
 def handle_starttag(self, tag, attrs):
 if tag == 'tr':
 for name, value in attrs:
 if name == 'class':
 print 'Found class', value
p = MyHTMLParser()
p.feed(ht)

Prints:

Found class Table_Heading
Found class Table_row
Found class alternat_table_row

P.S. I also recommend BeautifulSoup for parsing HTML with Python.

Share

Improve this answer

answered May 7, 2011 at 11:40

Eli Bendersky's user avatar

Eli Bendersky

276k92 gold badges372 silver badges427 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

user567879

user567879 Over a year ago

How to print the values like STATION CODE STATION NAME ORIGON ...?

2011年05月07日T13:11:52.073Z+00:00

Eli Bendersky

Eli Bendersky Over a year ago

@user567879: you can find td tags and process them

2011年05月07日T13:20:52.733Z+00:00

user567879

user567879 Over a year ago

Sorry for the dumb question. What I want is that I need to process ( couldnt find a method to print the value b/w tags) td tag between the table tag. How to handle that nesting?

2011年05月07日T13:37:56.543Z+00:00

Eli Bendersky

Eli Bendersky Over a year ago

@user567879: not sure what you're asking exactly... Still I recommend to take a look at BeautifulSoup - it provides a much higher-level API for HTML processing. HTMLParser is quite awkward to use compared to it

2011年05月07日T19:18:08.38Z+00:00

user567879

user567879 Over a year ago

I want to print the data in tr tag when class=Table_Heading or class=Table_row or class=alternate_table_row only. Can I use and cluause to make it work?

2011年05月08日T08:51:07.707Z+00:00

|

CollectivesTM on Stack Overflow

How to parse a HTML file with table using Python

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related