2
\$\begingroup\$

I'm scraping a table and am trying to get nested td tags down the tbody tree but the code seems kind of verbose. Is there a more Pythonic way to do this?

def get_top_subreddits(url):
 r = urllib.request.urlopen(url).read()
 soup = BeautifulSoup(r, "lxml")
 body = soup.find_all('tbody')
 top_subreddits = []
 for i in body:
 trs = i.find_all('tr')
 for tr in trs:
 tds = tr.find_all('td')
 for td in tds:
 texts = td.get_text()
 if '/r/' in texts:
 top_subreddits.append(texts)
 return top_subreddits
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Sep 19, 2017 at 3:32
\$\endgroup\$
1
  • \$\begingroup\$ What does the HTML look like? Are there any <td> that appear outside a <tr> in a <tbody>? \$\endgroup\$ Commented Sep 19, 2017 at 4:47

1 Answer 1

5
\$\begingroup\$

Yes, there is a more concise way to do it - using CSS selectors and a list comprehension:

top_subreddits = [
 td.get_text()
 for td in soup.select("tbody tr td")
 if '/r/' in td.text
]

tbody tr td would locate all td elements under tr elements which are under tbody.

I don't really like getting texts of td elements twice here, there is possibly a way to filter the desired information directly. E.g. if you were up to the subreddit links, we could've applied the /r/ check inside a selector:

top_subreddits = [
 a.get_text()
 for a in soup.select('tbody tr td a[href^="/r/"]')
]

^= here means "starts with".

answered Sep 19, 2017 at 13:09
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.