Python Text Encoding

Question 1

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.

I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?

Question 2

effbot might be able to help you.. effbot.org/zone/unicode-objects.htm

Question 3

possible duplicate of Convert XML/HTML Entities into Unicode String in Python

Question 4

Actually, I think this is Spanish (never heard this in French, anyway).

Question 5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
 contents = f.read().decode('utf-8-sig') # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

Question 6

what's BOM (in context of -sig) ?

Question 7

@MovieYoda: Ah, check out this article. Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. -sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.

Question 8

It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, &#xE9 is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.

To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html

Question 9

Yes and no. It represents a numeric code point. You can’t say it’s an escaped UTF-8 character. It may be a Unicode character, but that’s something different.

Question 10

Sure, all characters that exist in the set of Unicode characters are Unicode characters, of course. But with that definition, anything that can be decoded into Unicode is a Unicode string, including ASCII strings, and then the term "Unicode string" loses all meaning. A Unicode string is a string of Unicode data, and in Python, thats something held in a Unicode object. Anything that is encoded should not be called a Unicode string, it just makes people confused.

Question 11

It is HTML an this construct is called „entity". You can use

def entity_decode(match):
 _, is_hex, entity = match.groups()
 base = 16 if is_hex else 10
 return unichr(int(entity, base))
print re.sub("(?i)(&#(x?)([^;]+);)", 
 entity_decode,
 "Recu&#x90;rdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

Question 12

No, there are entities that are not Latin-1. Such as Α a greek Alpha . They are UCS-2, which is two byte and quite tricky to combine with your technique.

Question 13

It was a problem with your Latin-1 decoding technique, yes. Now you are using unichr, which works with number enteties. It still however, does not work with named enteties. And once you add that, your code will be the same as effbots code, that everyone else links to already. :-)

Question 14

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

Cameron 99.4k29 gold badges206 silver badges234 bronze badges · Accepted Answer · 2010-12-16 06:38:23Z

5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
 contents = f.read().decode('utf-8-sig') # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

Share

Improve this answer

edited May 23, 2017 at 11:47

Community's user avatar

Community Bot

11 silver badge

answered Dec 16, 2010 at 6:38

Cameron's user avatar

Cameron

99.4k29 gold badges206 silver badges234 bronze badges

Sign up to request clarification or add additional context in comments.

2 Comments

Srikar Appalaraju

Srikar Appalaraju Over a year ago

what's BOM (in context of -sig) ?

2010年12月16日T06:46:19.423Z+00:00

Cameron

Cameron Over a year ago

@MovieYoda: Ah, check out this article. Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. -sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.

2010年12月16日T06:51:28.673Z+00:00

CollectivesTM on Stack Overflow

Python Text Encoding

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

4 Answers 4

2 Comments

2 Comments

2 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related