I'm scraping a table and am trying to get nested td tags down the tbody tree but the code seems kind of verbose. Is there a more Pythonic way to do this?
def get_top_subreddits(url):
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r, "lxml")
body = soup.find_all('tbody')
top_subreddits = []
for i in body:
trs = i.find_all('tr')
for tr in trs:
tds = tr.find_all('td')
for td in tds:
texts = td.get_text()
if '/r/' in texts:
top_subreddits.append(texts)
return top_subreddits
1 Answer 1
Yes, there is a more concise way to do it - using CSS selectors and a list comprehension:
top_subreddits = [
td.get_text()
for td in soup.select("tbody tr td")
if '/r/' in td.text
]
tbody tr td
would locate all td
elements under tr
elements which are under tbody
.
I don't really like getting texts of td
elements twice here, there is possibly a way to filter the desired information directly. E.g. if you were up to the subreddit links, we could've applied the /r/
check inside a selector:
top_subreddits = [
a.get_text()
for a in soup.select('tbody tr td a[href^="/r/"]')
]
^=
here means "starts with".
Explore related questions
See similar questions with these tags.
<td>
that appear outside a<tr>
in a<tbody>
? \$\endgroup\$