While using BeautifulSoup to parse and extract all URLs present in an email,
when <a> tags are extracted, the href value has a modified value than that present in a tag.
Sample code:
import bs4
soup = bs4.BeautifulSoup(html_code)
for link in soup.findAll("a"):
print(link)
url = link.get("href")
print(url)
if url and "http" in url:
html_urls.append(url)
link
<a class="email-link email-textGray email-underline" href="https://medium.com/me/email-settings/276173762aee/75e6c9e76dd0?source=email-276173762aee-1573712256115-digest.reader-------------------------785f89d2_b50d_45eb_8b5e_7392fe13f6cf&type=social" style="color: #8e8e8e; text-decoration: underline;">Unsubscribe</a>
url Click-Here
Type: <class 'bs4.element.Tag'>
Notice the replacement of & with &
Could someone please point out why so and what exactly is happening? Traced the code in bs4 as well but couldn't find any leads.
αԋɱҽԃ αмєяιcαη
11.6k3 gold badges23 silver badges59 bronze badges
default
link.get('href')returns modified values.