Python Encoding Issue

Question 1

I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing.

korean_text = korean_text.encode('utf-8', 'ignore')
korean_text = unicode(korean_text, 'utf-8')

I save the above data to database, which goes through fine.

Later when i need to display data, i fetch content from db, and do the following:

korean_text = korean_text.encode( 'utf-8' )
print korean_text

And all i see is '???' echoed on the browser. Can someone please let me know what is the right way to save and display above data.

Thanks

Question 2

should the second 'encode' be a 'decode'?

Question 3

Do you have necessary fonts installed?

Question 4

Did you declare your output to be encoded with UTF-8?

Question 5

It's difficult to help you with the information you gave us; e.g. we do not know where korean_text comes from, how the database stores it, etc. etc. Maybe you can try to create a self-contained example. (Perhaps you'll find the solution yourself this way...)

Question 6

Your first two lines of code appear to be encoding from unicode to UTF-8 and then decoding it back to unicode -- this is pointless. Where did you get the unicode from in the first place?

Question 7

Even having read some docs, you seem to be confused on how unicode works.

Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.

The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.

Now for your problem:

If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.

You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.

Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.

Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.

Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.

When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.

If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.

Question 8

nokklo, Thanks for your response. Here is what i am doing - I get am fetching the RSS feed from following url - rixk.com using feedparser. After going through your great explanation above, i checked back with the feedparser docs and it says that each element value is returned as a Python Unicode String(with some exceptions - feedparser.org/docs/introduction.html). Now since the data is already Unicode, as per your explanation above, i should not really be encoding it early, instead work with the unicode string all along, until just before committing to the db. Is that right?

Question 9

Now since the data is already Unicode, as per your explanation above, i should not really be encoding it early -> Typo: You should've said "i should not really be decoding it early" - since it is already decoded (unicode). And it is being decoded as early as possible, by feedparser itself! You should just encode it before sending to db.

Question 10

@Nosklo: no typo, his first line of code is ENcoding "it", which is what (as he says) he should not be doing. His "it" refers to the unicode objects he gets from feedparser. Your "it" slides from the unicode back to the str objects that feed parser gets off the wire, with resultant confusion. Your answer is brilliant; all you needed to say in response to his question "Is that right?" was "Yes" :-)

Question 11

@John Machin, you are right on that :) . I will try above suggestions and update.

Question 12

okay. To add more info, i am using Google AppEngine datastore to save the data. And this incoming data goes into field with datatype db.Text, which can take a Unicode value - code.google.com/appengine/docs/python/datastore/…. So i dont really need to do a encode, Unicode step, and instead just save data directly. This works. However, when i get back the data from datastore, and do: data = data.encode( 'utf-8' ) and print data, the output doesn't look Korean to me. Any idea what might be going on here?

Question 13

Read through this post about handling Unicode in Python.

You basically want to be doing these things:

.encode() text to a particular encoding (such as utf-8) before sending it to the database.
.decode() text back to unicode (from your encoding) when reading it from the database

Question 14

The problem is most certainly (especially if other non-ASCII characters appear to work fine) that your browser or OS doesn't have the right fonts to display Korean text, or that the default font used by your browser doesn't support Korean. Try to choose another font until it works.

nosklo 224k58 gold badges300 silver badges299 bronze badges · Accepted Answer · 2010-01-05 13:45:18Z

Even having read some docs, you seem to be confused on how unicode works.

Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.

The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.

Now for your problem:

If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.

You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.

Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.

Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.

Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.

When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.

If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.

nokklo, Thanks for your response. Here is what i am doing - I get am fetching the RSS feed from following url - rixk.com using feedparser. After going through your great explanation above, i checked back with the feedparser docs and it says that each element value is returned as a Python Unicode String(with some exceptions - feedparser.org/docs/introduction.html). Now since the data is already Unicode, as per your explanation above, i should not really be encoding it early, instead work with the unicode string all along, until just before committing to the db. Is that right?
Now since the data is already Unicode, as per your explanation above, i should not really be encoding it early -> Typo: You should've said "i should not really be decoding it early" - since it is already decoded (unicode). And it is being decoded as early as possible, by feedparser itself! You should just encode it before sending to db.
@Nosklo: no typo, his first line of code is ENcoding "it", which is what (as he says) he should not be doing. His "it" refers to the unicode objects he gets from feedparser. Your "it" slides from the unicode back to the str objects that feed parser gets off the wire, with resultant confusion. Your answer is brilliant; all you needed to say in response to his question "Is that right?" was "Yes" :-)
@John Machin, you are right on that :) . I will try above suggestions and update.
okay. To add more info, i am using Google AppEngine datastore to save the data. And this incoming data goes into field with datatype db.Text, which can take a Unicode value - code.google.com/appengine/docs/python/datastore/…. So i dont really need to do a encode, Unicode step, and instead just save data directly. This works. However, when i get back the data from datastore, and do: data = data.encode( 'utf-8' ) and print data, the output doesn't look Korean to me. Any idea what might be going on here?

CollectivesTM on Stack Overflow

Python Encoding Issue

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

7 Comments

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related