1. Home
2. Questions
3. AI Assist
4. Tags
5. Challenges
6. Chat
7. Articles
8. Users
9. Companies
11. Communities for your favorite technologies. Explore all Collectives
Stack Internal

Stack Overflow for Teams is now called Stack Internal. Bring the best of human thought and AI automation together at your work.
Try for free Learn more
Bring the best of human thought and AI automation together at your work. Learn more

BeautifulSoup HTML parser modifies a tag href

Asked 5 years, 11 months ago

Viewed 63 times

While using BeautifulSoup to parse and extract all URLs present in an email, when <a> tags are extracted, the href value has a modified value than that present in a tag.

Sample code:

import bs4
soup = bs4.BeautifulSoup(html_code)
for link in soup.findAll("a"):
 print(link)
 url = link.get("href")
 print(url)
 if url and "http" in url:
 html_urls.append(url)

link

<a class="email-link email-textGray email-underline" href="https://medium.com/me/email-settings/276173762aee/75e6c9e76dd0?source=email-276173762aee-1573712256115-digest.reader-------------------------785f89d2_b50d_45eb_8b5e_7392fe13f6cf&amp;type=social" style="color: #8e8e8e; text-decoration: underline;">Unsubscribe</a>

url Click-Here

Type: <class 'bs4.element.Tag'>

Notice the replacement of & with &

Could someone please point out why so and what exactly is happening? Traced the code in bs4 as well but couldn't find any leads.

Improve this question

edited Jan 10, 2020 at 9:01

αԋɱҽԃ αмєяιcαη's user avatar

αԋɱҽԃ αмєяιcαη

11.6k3 gold badges23 silver badges59 bronze badges

asked Jan 10, 2020 at 7:25

cisnik's user avatar

cisnik

1071 silver badge10 bronze badges

I think beautifulSoup changes the ‘HTML Character Entity’ like ‘&’ to what it actually means like ‘&’. If you don’t want this, do stackoverflow.com/questions/23191624/…

sjlee
– sjlee

2020年01月10日 07:45:11 +00:00
Commented Jan 10, 2020 at 7:45
i don't understand what you want to do ? you want to escape the characters ? check

αԋɱҽԃ αмєяιcαη
– αԋɱҽԃ αмєяιcαη

2020年01月10日 09:03:38 +00:00
Commented Jan 10, 2020 at 9:03
I want the URLs in href to remain intact. But link.get('href') returns modified values.

cisnik
– cisnik

2020年01月10日 10:43:39 +00:00
Commented Jan 10, 2020 at 10:43

Add a comment |

0

Sorted by: Reset to default

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved

Draft discarded

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

default

CollectivesTM on Stack Overflow

BeautifulSoup HTML parser modifies a tag href

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions