3

I'm newbie in parsing tables and regular expressions, can you help to parse this in python:

<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>

I need the "3text" and "6text"

denis
22k12 gold badges68 silver badges92 bronze badges
asked Jul 22, 2020 at 8:39
1

6 Answers 6

3

You can use CSS selector select() and select_one() to get "3text" and "6text" like below:

import requests
from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, 'lxml')
soup1 = soup.select('tr')
for i in soup1:
 print(i.select_one('td:nth-child(2)').text)

You can also use find_all method:

trs = soup.find('table').find_all('tr')
for i in trs:
 tds = i.find_all('td')
 print(tds[1].text)

Result:

3text 
6text 
answered Jul 22, 2020 at 9:01
Sign up to request clarification or add additional context in comments.

Comments

3

best way is to use beautifulsoup

from bs4 import BeautifulSoup
html_doc='''
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>
'''
soup = BeautifulSoup(html_doc, "html.parser")
# finds all tr tags
for i in soup.find_all("tr"):
 # finds all td tags in tr tags
 for k in i.find_all("td"):
 # prints all td tags with a text format
 print(k.text)

in this case it prints

1text 2text
3text 
4text 5text
6text 

but you can grab the texts you want with indexing. In this case you could just go with

# finds all tr tags
for i in soup.find_all("tr"):
 # finds all td tags in tr tags
 print(i.find_all("td")[1].text)
answered Jul 22, 2020 at 9:10

Comments

2

using pandas

In [8]: import pandas as pd
In [9]: df = pd.read_html(html_table)[0]
In [10]: df[1]
Out[10]:
0 3text
1 6text
Name: 1, dtype: object
answered Jul 22, 2020 at 9:02

Comments

1

you could use pythons html.parser: https://docs.python.org/3/library/html.parser.html

the custom parser class tracking a bit the state of the current parsing. since you want the second cell of each row, when starting a row, each row resets the cell counter (index). each cell increments the counter.

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):
 def __init__(self):
 super().__init__()
 self.in_cell = False
 self.cell_index = -1
 def handle_starttag(self, tag, attrs):
 if tag == 'tr':
 self.cell_index = -1
 if tag == 'td':
 self.in_cell = True
 self.cell_index += 1
 # print("Encountered a start tag:", tag)
 def handle_endtag(self, tag):
 if tag == 'td':
 self.in_cell = False
 # print("Encountered an end tag :", tag)
 def handle_data(self, data):
 if self.in_cell and self.cell_index == 1:
 print(data.strip())
parser = MyHTMLParser()
parser.feed('''<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>''')

outputs:

> python -u "html_parser_test.py"
3text
6text
answered Jul 22, 2020 at 9:08

Comments

1

Since your question has the beautifulsoup tag attached I am going to assume that you are happy using this module to tackle the problem you are having. My solution also makes use of the builtin unicodedata module to parse any escaped characters present within the HTML (e.g. &nbsp;).

To parse the table so that you have access to the second field from each row within the table (as per your question), please see the below code/comments.

from bs4 import BeautifulSoup
import unicodedata
table = '''<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>'''
soup = BeautifulSoup(table, 'html.parser') # Parse HTML table 
tableData = soup.find_all('td') # Get list of all <td> tags from table
# Store normalized content (basically parse unicode characters, affecting spaces in this case) from every 2nd <td> tag from table to list 
output = [ unicodedata.normalize('NFKC', d.text) for i, d in enumerate(tableData) if i % 2 != 0 ]
answered Jul 22, 2020 at 9:12

Comments

1

Try this:

from bs4 import BeautifulSoup
html="""
<table callspacing="0" cellpadding="0">
 <tbody><tr>
 <td>1text&nbsp;2text</td>
 <td>3text&nbsp;</td>
 </tr>
 <tr>
 <td>4text&nbsp;5text</td>
 <td>6text&nbsp;</td>
 </tr>
</tbody></table>"""
soup = BeautifulSoup(html, 'html.parser')
for tr_soup in soup.find_all('tr'):
 td_soup = tr_soup.find_all('td')
 print(td_soup[1].text.strip())
Saba
4845 silver badges15 bronze badges
answered Jul 22, 2020 at 9:07

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.