How to parse html table in python

Question 1

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:

<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>

I need the "3text" and "6text"

Question 2

There's a standalone html-table-parser-python3; it works on table 5 in Wikipedia Windturbines_in_Nederland, BeautifulSoup doesn't.

Question 3

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
 print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')
for i in trs:
 tds = i.find_all('td')
 print(tds[1].text)

Result:

3text 
6text

Question 4

best way is to use beautifulsoup

from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
 # finds all td tags in tr tags
 for k in i.find_all("td"):
 # prints all td tags with a text format
 print(k.text)

in this case it prints

1text 2text
3text 
4text 5text
6text

but you can grab the texts you want with indexing. In this case you could just go with

# finds all tr tags
for i in soup.find_all("tr"):
 # finds all td tags in tr tags
 print(i.find_all("td")[1].text)

Question 5

using pandas

In [8]: import pandas as pd
In [9]: df = pd.read_html(html_table)[0]
In [10]: df[1]
Out[10]:
0 3text
1 6text
Name: 1, dtype: object

Question 6

you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html

the custom parser class tracking a bit the state of the current parsing. since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
 def __init__(self):
 super().__init__()
 self.in_cell = False
 self.cell_index = -1
 def handle_starttag(self, tag, attrs):
 if tag == 'tr':
 self.cell_index = -1
 if tag == 'td':
 self.in_cell = True
 self.cell_index += 1
 # print("Encountered a start tag:", tag)
 def handle_endtag(self, tag):
 if tag == 'td':
 self.in_cell = False
 # print("Encountered an end tag :", tag)
 def handle_data(self, data):
 if self.in_cell and self.cell_index == 1:
 print(data.strip())
parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>''')

outputs:

> python -u "html_parser_test.py"
3text
6text

Question 7

Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g.  ).

To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.

from bs4 import BeautifulSoup
import unicodedata
table = '''<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>'''
soup = BeautifulSoup(table, 'html.parser') # Parse HTML table 
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list 
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]

Question 8

Try this:

from bs4 import BeautifulSoup
html="""
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>"""
soup = BeautifulSoup(html, 'html.parser')
for tr_soup in soup.find_all('tr'):
 td_soup = tr_soup.find_all('td')
 print(td_soup[1].text.strip())

Humayun Ahmad Rajib 1,5581 gold badge12 silver badges25 bronze badges · Accepted Answer · 2020-07-22 09:01:55Z

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
 print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')
for i in trs:
 tds = i.find_all('td')
 print(tds[1].text)

Result:

3text 
6text

CollectivesTM on Stack Overflow

How to parse html table in python

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

6 Answers 6

Comments

Comments

Comments

Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related