I am very new to using beautifulsoup therefore my question might seem like I am misunderstanding something, however here goes.
I am currently trying to make a synonym dictionary as the ones I can currently find are not amazing. In this regard I am building on someone elses work, the guy who made PyDictionary, therefore I am pulling synonyms from http://www.thesaurus.com/
In this example I am trying to pull only the noun synonyms from view-source:http://www.thesaurus.com/browse/animal?s=t
I have found this piece which indicates the the synonyms under the next relevancy block are nouns:
<div class="synonym-description">
<em class="txt">noun</em>
<strong class="ttl">animate being; mammal</strong>
</div>
<div class="relevancy-block">
<div class="relevancy-list">
My next question is essentially how do I specify that I only want to look in the class block "relevancy-list" directly after the class="txt>noun
After this I wanna look for the line
<li><a href="http://www.thesaurus.com/browse/pet" class="common-word" data-id="1" data-category="{"name": "relevant-3", "color": "#fcbb45"}" data-complexity="1" data-length="1"><span class="text">pet</span><span class="star inactive">star</span></a></li>
And pull out the text under class="txt"
Currently I am loading it into an object via :
BeautifulSoup(requests.get(url).text)
How I am literally at a loss of where to go next, I have tried googling but to no real avail.
3 Answers 3
import requests, bs4
url = "http://www.thesaurus.com/browse/animal?s=t"
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
for txt in soup.find_all(class_="txt"):
relevancy_list = txt.find_next(class_="relevancy-list")
1 Comment
You can use the find_all function where the first argument is the type ('div', 'a' etc.) and in the second argument you can filter by class.
soup.find_all('em', {'class':"txt"})
This way you will get all 'em' with the class 'txt'.
soup.find_all('div', {'class':"relevancy-block"})
Here you will find all the 'div' with class name 'relevancy-block'
3 Comments
I found a way of doing this thanks to both comments I received:
The following code first looks at the filters then subsequently if the filter is a noun or a verb, if it is a noun it lists all the nouns classified as common-words
def _get_soup_object(url):
return BeautifulSoup(requests.get(url).text)
term="animal"
data = _get_soup_object("http://www.thesaurus.com/browse/{0}".format(term))
for selector_var in data.find_all(class_="filters"):
word_type=selector_var.find_all(class_="txt")
if word_type[0].text=="adj":
print("This is an adjective, which we don't want")
elif word_type[0].text=="noun":
print("This is a noun, which we do want")
word_list=selector_var.find_all(class_="common-word")
for indv_word in word_list:
print(indv_word.text[:-4])