DjangoUnicodeDecodeError: [Bad Unicode data]

Question 1

The Model:

class ItemType(models.Model):
 name = models.CharField(max_length=100)
 def __unicode__(self):
 logger.debug("1. Item Type %s created" % self.name)
 return self.name

The code:

 (...)
 type = re.search(r"Type:(.*?)",text)
 itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
 logger.debug("2. Item Type %s created" % name.group(1))
 logger.debug("4. Item Type %s created" % itemtype.name)
 logger.debug("3. Item Type %s created" % itemtype)

And the result is unexpected (to me of course):

The first logger.debug prints Item Type ąęńłśóć created as expected, but the second raises error:

DjangoUnicodeDecodeError: 'ascii' codec can't decode byte in position : 
ordinal not in range(128). 
You passed in <ItemType: [Bad Unicode data]> (<class 'aaa.models.ItemType'>)

Why there's an error, and how can I fix it?

(text is html response with utf-8 encoding)

updated

I add debug into model and debug result is:

2014年10月06日 09:38:53,342 DEBUG views 2. Item Type ąęćńółśż created
2014年10月06日 09:38:53,342 DEBUG views 4. Item Type ąęćńółśż created
2014年10月06日 09:38:53,344 DEBUG models 1. Item Type ąęćńółśż created
2014年10月06日 09:38:53,358 DEBUG models 1. Item Type ąęćńółśż created

so why debug 3. can't print it?

UPDATE 2 The problem is here:

 itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})

if I changed it into

 itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':u'ĄĆĘŃŁÓŚ'})

everything was ok.

So how to convert it into unicode? unicode(name.group(1)) doesn't work.

Question 2

Which database are you using? Oracle? Also could you try changing to logger.debug("1. Item Type %s created", self.name). In loggers avoid using '%'.

Question 3

changed to logger.debug(itemtype) the same error

Question 4

This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error. see saltycrane.com/blog/2008/11/…

Question 5

Is your Postgres configured to accept unicode ?

Question 6

Yes, postgres is configured properly. But i can't agree that create expects ascii if i add u"żółć" instead of regex result model debug prints expected value. I think that i have to encode result but how encode and unicode don't work.

Question 7

After two days of figthing with own shadow I found a solution. It isn't a workaround for this case, but complex change of thinking and I have to refactor whole code.

My assumption is EVERY STRING is UNICODE. If it isn't - fix it.
do not use "%s" or "something" ALWAYS use u"%s" and u"cośtam"
In every model which has models.CharField() or other "text" oriented fields I override save() method:

in example:

class ItemType(models.Model):
 name = models.CharField(max_length=100)
 def save(self, *args, **kwargs):
 if isinstance(self.name, str):
 self.name=self.name.decode("utf-8")
 super(ItemType, self).save(*args, **kwargs)

Explanation - if somehow the name is filled with str not unicode - CHANGE it into unicode.

How I found this:

I was wondering what type is text in models.CharField, and found, that if you fill it with unicode - it is unicode, if you fill - str - it's str. So if you once fill it by "hand" with unicode, and in other place regex fill it with str - the result is unexpected.

The biggest problem of unicode and str is that is no problem of using diactrics with both:

>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć

so you can't see the difference.

But if you use other command:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

The difference glares.

if there is some setting to change the behaviour of print (and similiars) to this:

>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć

everything would be much easier to debug - if you can see diactrics it's ok - if not - it's bad.

Using the decode('utf-8') leads me to the solution:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

VOILA!

Tomasz Brzezina 1,5445 gold badges24 silver badges48 bronze badges · Accepted Answer · 2014-10-06 18:37:25Z

After two days of figthing with own shadow I found a solution. It isn't a workaround for this case, but complex change of thinking and I have to refactor whole code.

My assumption is EVERY STRING is UNICODE. If it isn't - fix it.
do not use "%s" or "something" ALWAYS use u"%s" and u"cośtam"
In every model which has models.CharField() or other "text" oriented fields I override save() method:

in example:

class ItemType(models.Model):
 name = models.CharField(max_length=100)
 def save(self, *args, **kwargs):
 if isinstance(self.name, str):
 self.name=self.name.decode("utf-8")
 super(ItemType, self).save(*args, **kwargs)

Explanation - if somehow the name is filled with str not unicode - CHANGE it into unicode.

How I found this:

I was wondering what type is text in models.CharField, and found, that if you fill it with unicode - it is unicode, if you fill - str - it's str. So if you once fill it by "hand" with unicode, and in other place regex fill it with str - the result is unexpected.

The biggest problem of unicode and str is that is no problem of using diactrics with both:

>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć

so you can't see the difference.

But if you use other command:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

The difference glares.

if there is some setting to change the behaviour of print (and similiars) to this:

>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć

everything would be much easier to debug - if you can see diactrics it's ok - if not - it's bad.

Using the decode('utf-8') leads me to the solution:

>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'

VOILA!

CollectivesTM on Stack Overflow

DjangoUnicodeDecodeError: [Bad Unicode data]

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related