0

While using BeautifulSoup to parse and extract all URLs present in an email, when <a> tags are extracted, the href value has a modified value than that present in a tag.

Sample code:

import bs4
soup = bs4.BeautifulSoup(html_code)
for link in soup.findAll("a"):
 print(link)
 url = link.get("href")
 print(url)
 if url and "http" in url:
 html_urls.append(url)

link

<a class="email-link email-textGray email-underline" href="https://medium.com/me/email-settings/276173762aee/75e6c9e76dd0?source=email-276173762aee-1573712256115-digest.reader-------------------------785f89d2_b50d_45eb_8b5e_7392fe13f6cf&amp;type=social" style="color: #8e8e8e; text-decoration: underline;">Unsubscribe</a>

url Click-Here

Type: <class 'bs4.element.Tag'>

Notice the replacement of &amp; with &

Could someone please point out why so and what exactly is happening? Traced the code in bs4 as well but couldn't find any leads.

asked Jan 10, 2020 at 7:25
3
  • I think beautifulSoup changes the ‘HTML Character Entity’ like ‘&amp;’ to what it actually means like ‘&’. If you don’t want this, do stackoverflow.com/questions/23191624/… Commented Jan 10, 2020 at 7:45
  • i don't understand what you want to do ? you want to escape the characters ? check Commented Jan 10, 2020 at 9:03
  • I want the URLs in href to remain intact. But link.get('href') returns modified values. Commented Jan 10, 2020 at 10:43

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.