0

I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.

I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?

asked Dec 16, 2010 at 6:34
3

4 Answers 4

5

Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).

For example, if you know the encoding is UTF-8:

with open('foo.txt', 'rb') as f:
 contents = f.read().decode('utf-8-sig') # -sig takes care of BOM if present

The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).

answered Dec 16, 2010 at 6:38
Sign up to request clarification or add additional context in comments.

2 Comments

what's BOM (in context of -sig) ?
@MovieYoda: Ah, check out this article. Basically, when it takes multiple bytes together to represent a single character (as can be the case with UTF-8), those bytes could be interpreted in the a different order than intended (this order is called endianness). Because of this, a special unambiguous (and optional, in the case of UTF-8) mark is placed at the beginning of the file to indicate the endianness of the file. -sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.
1

It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, &#xE9 is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.

To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html

answered Dec 16, 2010 at 6:46

2 Comments

Yes and no. It represents a numeric code point. You can’t say it’s an escaped UTF-8 character. It may be a Unicode character, but that’s something different.
Sure, all characters that exist in the set of Unicode characters are Unicode characters, of course. But with that definition, anything that can be decoded into Unicode is a Unicode string, including ASCII strings, and then the term "Unicode string" loses all meaning. A Unicode string is a string of Unicode data, and in Python, thats something held in a Unicode object. Anything that is encoded should not be called a Unicode string, it just makes people confused.
0

It is HTML an this construct is called „entity". You can use

def entity_decode(match):
 _, is_hex, entity = match.groups()
 base = 16 if is_hex else 10
 return unichr(int(entity, base))
print re.sub("(?i)(&#(x?)([^;]+);)", 
 entity_decode,
 "Recurdame")

to decode all etities.

Edit: Yes, they are of course not latin1, now it should work with all entities

answered Dec 16, 2010 at 6:46

2 Comments

No, there are entities that are not Latin-1. Such as Α a greek Alpha . They are UCS-2, which is two byte and quite tricky to combine with your technique.
It was a problem with your Latin-1 decoding technique, yes. Now you are using unichr, which works with number enteties. It still however, does not work with named enteties. And once you add that, your code will be the same as effbots code, that everyone else links to already. :-)
0

Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.

answered Dec 23, 2012 at 15:32

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.