I have this text in a file - Recuérdame (notice it's a French word). When I read this file with a python script, I get this text as Recuérdame.
I read it as a unicode string. Do I need to find what the encoding of the text is & decode this? or is my terminal playing tricks on me?
-
effbot might be able to help you.. effbot.org/zone/unicode-objects.htmWilliam– William2010年12月16日 06:37:22 +00:00Commented Dec 16, 2010 at 6:37
-
1possible duplicate of Convert XML/HTML Entities into Unicode String in PythonJosh Lee– Josh Lee2010年12月16日 06:37:44 +00:00Commented Dec 16, 2010 at 6:37
-
7Actually, I think this is Spanish (never heard this in French, anyway).Cameron– Cameron2010年12月16日 06:55:23 +00:00Commented Dec 16, 2010 at 6:55
4 Answers 4
Yes, you need to know the encoding of the text file to turn in into a unicode string (from the bytes that make up the file).
For example, if you know the encoding is UTF-8:
with open('foo.txt', 'rb') as f:
contents = f.read().decode('utf-8-sig') # -sig takes care of BOM if present
The text in your file seems not to be encoded Unicode, however; the accented character is apparently stored as an XML entity, which will have to be converted manually (tip of the hat to jleedev for the link).
2 Comments
-sig removes the BOM if it's present so you don't get the marker appearing as part of your unicode string.It is not a Unicode string. It's a string in whatever encoding it is encoded in. Hence it's a UTF-8 or a Latin-1 or something else string. In this case, é is a HTML/XML entity representing é, specifically. It's an encoding used in HTML and XML to encode non-ascii data.
To decode that into Unicode, look at Fredrik Lundhs method: http://effbot.org/zone/re-sub.htm#unescape-html
2 Comments
It is HTML an this construct is called „entity". You can use
def entity_decode(match):
_, is_hex, entity = match.groups()
base = 16 if is_hex else 10
return unichr(int(entity, base))
print re.sub("(?i)(&#(x?)([^;]+);)",
entity_decode,
"Recurdame")
to decode all etities.
Edit: Yes, they are of course not latin1, now it should work with all entities
2 Comments
Working with xlrd, I have in a line ...xl_data.find(str(cell_value))... which gives the error:"'ascii' codec can't encode character u'\xdf' in position 3: ordinal not in range(128)". All suggestions in the forums have been useless for my german words. But changing into: ...xl_data.find(cell.value)... gives no error. So, I suppose using strings as arguments in certain commands with xldr has specific encoding problems.
Comments
Explore related questions
See similar questions with these tags.