The Model:
class ItemType(models.Model):
name = models.CharField(max_length=100)
def __unicode__(self):
logger.debug("1. Item Type %s created" % self.name)
return self.name
The code:
(...)
type = re.search(r"Type:(.*?)",text)
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
logger.debug("2. Item Type %s created" % name.group(1))
logger.debug("4. Item Type %s created" % itemtype.name)
logger.debug("3. Item Type %s created" % itemtype)
And the result is unexpected (to me of course):
The first logger.debug prints Item Type ąęńłśóć created as expected, but the second raises error:
DjangoUnicodeDecodeError: 'ascii' codec can't decode byte in position :
ordinal not in range(128).
You passed in <ItemType: [Bad Unicode data]> (<class 'aaa.models.ItemType'>)
Why there's an error, and how can I fix it?
(text is html response with utf-8 encoding)
updated
I add debug into model and debug result is:
2014年10月06日 09:38:53,342 DEBUG views 2. Item Type ąęćńółśż created
2014年10月06日 09:38:53,342 DEBUG views 4. Item Type ąęćńółśż created
2014年10月06日 09:38:53,344 DEBUG models 1. Item Type ąęćńółśż created
2014年10月06日 09:38:53,358 DEBUG models 1. Item Type ąęćńółśż created
so why debug 3. can't print it?
UPDATE 2 The problem is here:
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':name.group(1)})
if I changed it into
itemtype = ItemType.objects.create(name = name.group(1), defaults={'name':u'ĄĆĘŃŁÓŚ'})
everything was ok.
So how to convert it into unicode? unicode(name.group(1)) doesn't work.
-
Which database are you using? Oracle? Also could you try changing to logger.debug("1. Item Type %s created", self.name). In loggers avoid using '%'.fragles– fragles2014年10月06日 08:28:10 +00:00Commented Oct 6, 2014 at 8:28
-
changed to logger.debug(itemtype) the same errorTomasz Brzezina– Tomasz Brzezina2014年10月06日 08:54:23 +00:00Commented Oct 6, 2014 at 8:54
-
This error occurs when you pass a Unicode string containing non-English characters (Unicode characters beyond 128) to something that expects an ASCII bytestring. The default encoding for a Python bytestring is ASCII, "which handles exactly 128 (English) characters". This is why trying to convert Unicode characters beyond 128 produces the error. see saltycrane.com/blog/2008/11/…fragles– fragles2014年10月06日 09:05:25 +00:00Commented Oct 6, 2014 at 9:05
-
Is your Postgres configured to accept unicode ?fragles– fragles2014年10月06日 09:29:24 +00:00Commented Oct 6, 2014 at 9:29
-
Yes, postgres is configured properly. But i can't agree that create expects ascii if i add u"żółć" instead of regex result model debug prints expected value. I think that i have to encode result but how encode and unicode don't work.Tomasz Brzezina– Tomasz Brzezina2014年10月06日 10:32:21 +00:00Commented Oct 6, 2014 at 10:32
1 Answer 1
After two days of figthing with own shadow I found a solution. It isn't a workaround for this case, but complex change of thinking and I have to refactor whole code.
My assumption is EVERY STRING is UNICODE. If it isn't - fix it.
do not use "%s" or "something" ALWAYS use u"%s" and u"cośtam"
- In every model which has models.CharField() or other "text" oriented fields I override save() method:
in example:
class ItemType(models.Model):
name = models.CharField(max_length=100)
def save(self, *args, **kwargs):
if isinstance(self.name, str):
self.name=self.name.decode("utf-8")
super(ItemType, self).save(*args, **kwargs)
Explanation - if somehow the name is filled with str not unicode - CHANGE it into unicode.
How I found this:
I was wondering what type is text in models.CharField, and found, that if you fill it with unicode - it is unicode, if you fill - str - it's str. So if you once fill it by "hand" with unicode, and in other place regex fill it with str - the result is unexpected.
The biggest problem of unicode and str is that is no problem of using diactrics with both:
>>> text_str = "żółć"
>>> text_unicode = u"żółć"
>>> print text_str
żółć
>>> print text_uni
żółć
so you can't see the difference.
But if you use other command:
>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_uni
u'\u017c\xf3\u0142\u0107'
The difference glares.
if there is some setting to change the behaviour of print (and similiars) to this:
>>> print text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> print text_uni
żółć
everything would be much easier to debug - if you can see diactrics it's ok - if not - it's bad.
Using the decode('utf-8') leads me to the solution:
>>> text_str
'\xc5\xbc\xc3\xb3\xc5\x82\xc4\x87'
>>> text_str.decode('utf-8')
u'\u017c\xf3\u0142\u0107'
>>> text_uni
u'\u017c\xf3\u0142\u0107'
VOILA!
Comments
Explore related questions
See similar questions with these tags.