Extracting top subreddits from an HTML table

Asked 7 years, 11 months ago

Viewed 143 times

\$\begingroup\$

I'm scraping a table and am trying to get nested td tags down the tbody tree but the code seems kind of verbose. Is there a more Pythonic way to do this?

def get_top_subreddits(url):
 r = urllib.request.urlopen(url).read()
 soup = BeautifulSoup(r, "lxml")
 body = soup.find_all('tbody')
 top_subreddits = []
 for i in body:
 trs = i.find_all('tr')
 for tr in trs:
 tds = tr.find_all('td')
 for td in tds:
 texts = td.get_text()
 if '/r/' in texts:
 top_subreddits.append(texts)
 return top_subreddits

edited Sep 19, 2017 at 4:45

200_success's user avatar

200_success

145k22 gold badges190 silver badges478 bronze badges

asked Sep 19, 2017 at 3:32

user149128's user avatar

user149128 user149128

211 bronze badge

\$\endgroup\$

\$\begingroup\$ What does the HTML look like? Are there any <td> that appear outside a <tr> in a <tbody>? \$\endgroup\$

200_success
– 200_success

2017年09月19日 04:47:59 +00:00
Commented Sep 19, 2017 at 4:47

Add a comment |

1 Answer 1

Sorted by: Reset to default

\$\begingroup\$

Yes, there is a more concise way to do it - using CSS selectors and a list comprehension:

top_subreddits = [
 td.get_text()
 for td in soup.select("tbody tr td")
 if '/r/' in td.text
]

tbody tr td would locate all td elements under tr elements which are under tbody.

I don't really like getting texts of td elements twice here, there is possibly a way to filter the desired information directly. E.g. if you were up to the subreddit links, we could've applied the /r/ check inside a selector:

top_subreddits = [
 a.get_text()
 for a in soup.select('tbody tr td a[href^="/r/"]')
]

^= here means "starts with".

answered Sep 19, 2017 at 13:09

alecxe's user avatar

alecxe alecxe

17.5k8 gold badges52 silver badges93 bronze badges

\$\endgroup\$

Add a comment |

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

lang-py

Stack Exchange Network

Extracting top subreddits from an HTML table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Extracting top subreddits from an HTML table

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions