Skip to main content
Stack Overflow
  1. About
  2. For Teams

Return to Question

Post Closed as "Duplicate" by Antti Haapala python Users with the python badge or a synonym can single-handedly close questions as duplicates and reopen them as needed.
deleted 57 characters in body; edited tags
Source Link
Josh Lee
  • 179.3k
  • 39
  • 279
  • 282

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back: & #x01ce; (There is no space. I put that so Markdown won't interpret it) which

ǎ

which represents an "a""ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'u'\u01ce'

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back: & #x01ce; (There is no space. I put that so Markdown won't interpret it) which represents an "a" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back:

ǎ

which represents an "ǎ" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

Source Link
Cristian
  • 44.2k
  • 28
  • 90
  • 99

Convert XML/HTML Entities into Unicode String in Python

I'm doing some web scraping and sites frequently use HTML entities to represent non ascii characters. Does Python have a utility that takes a string with HTML entities and returns a unicode type?

For example:

I get back: & #x01ce; (There is no space. I put that so Markdown won't interpret it) which represents an "a" with a tone mark. In binary, this is represented as the 16 bit 01ce. I want to convert the html entity into the value u'\u01ce'

default

AltStyle によって変換されたページ (->オリジナル) /