Python utf8 encoding problem

Question 1

I'm working on a Python application and having some problems handling strings.

There is this string "She’s Out of My League" (without quotes). I stored it in a variable and tried to insert it into an sqlite3 database. But, I get this error:

sqlite3.ProgrammingError: You must not use 8-bit bytestrings unless you use a text_factory that can interpret 8-bit bytestrings (like text_factory = str). It is highly recommended that you instead just switch your application to Unicode strings.

So, I tried to convert the string to unicode. I tried both of these:

new_str = unicode(old_str)
new_str = old_str.encode("utf8")

But this gives me another error:

UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 49: unexpected code byte

I'm stuck here. What am I doing wrong ?

Question 2

Try .decode instead of .encode.

Question 3

You want old_str.decode(encoding), and you don't need (in fact, you can't) to encode it back to a bytestring for use with sqlite, sqlite requires unicode.

Question 4

Simple. You're assuming that it's UTF-8.

>>> print 'She\x92s Out of My League'.decode('cp1252')
She’s Out of My League

Question 5

So, will cp1252 work with all? I'm dealing with filenames here. Filenames both on Windows and Unix.

Question 6

Ya, I get that. I want something to work with all the characters allowed in a filename. Which one do I choose ?

Question 7

There isn't any one encoding you can use, unless you force the encoding input into your software. Have fun!

Question 8

sys.getfilesystemencoding() returns a guess about the filesystem encoding of the current system, and all path functions (e.g. os.path.join, os.listdir) would return unicode (using this guessed encoding) if you give them unicode arguments. Also if you're using cp1252 on a Unix system, you might consider switching to utf8 to avoid bigger issues.

Question 9

Always use Unicode strings for filenames (and probably for everything else except raw byte arrays without textual interpretation). Then Unicode file names will be handled correctly for both Windows and Unix-like systems.

Ignacio Vazquez-Abrams 804k160 gold badges1.4k silver badges1.4k bronze badges · Accepted Answer · 2011-05-24 18:58:01Z

1

Simple. You're assuming that it's UTF-8.

>>> print 'She\x92s Out of My League'.decode('cp1252')
She’s Out of My League

Share

Improve this answer

answered May 24, 2011 at 18:58

Ignacio Vazquez-Abrams's user avatar

Ignacio Vazquez-Abrams

804k160 gold badges1.4k silver badges1.4k bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Shrihari

Shrihari Over a year ago

So, will cp1252 work with all? I'm dealing with filenames here. Filenames both on Windows and Unix.

2011年05月24日T19:01:41.397Z+00:00

Shrihari

Shrihari Over a year ago

Ya, I get that. I want something to work with all the characters allowed in a filename. Which one do I choose ?

2011年05月24日T19:03:58.41Z+00:00

Ignacio Vazquez-Abrams

Ignacio Vazquez-Abrams Over a year ago

There isn't any one encoding you can use, unless you force the encoding input into your software. Have fun!

2011年05月24日T19:06:04.933Z+00:00

Rosh Oxymoron

Rosh Oxymoron Over a year ago

sys.getfilesystemencoding() returns a guess about the filesystem encoding of the current system, and all path functions (e.g. os.path.join, os.listdir) would return unicode (using this guessed encoding) if you give them unicode arguments. Also if you're using cp1252 on a Unix system, you might consider switching to utf8 to avoid bigger issues.

2011年05月24日T20:09:52.007Z+00:00

Philipp

Philipp Over a year ago

Always use Unicode strings for filenames (and probably for everything else except raw byte arrays without textual interpretation). Then Unicode file names will be handled correctly for both Windows and Unix-like systems.

2011年05月28日T06:36:15.48Z+00:00

CollectivesTM on Stack Overflow

Python utf8 encoding problem

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related