[Python-Dev] Type-converting functions, esp. unicode() vs. unistr()

2001年1月18日 02:14:19 -0800 (PST)

I hope you don't mind that i'm taking this over to python-dev,
because it led me to discover a more general issue (see below).
For the others on python-dev, here's the background: MAL was
about to check in the unistr() function, described as follows:
> This patch adds a utility function unistr() which works just like
> the standard builtin str() -- only that the return value will
> always be a Unicode object.
>> The patch also adds a new object level C API PyObject_Unicode()
> which complements PyObject_Str().

I responded:
> Why are unistr() and unicode() two separate functions?
>> str() performs one task: convert to string. It can convert anything,
> including strings or Unicode strings, numbers, instances, etc.
>> The other type-named functions e.g. int(), long(), float(), list(),
> tuple() are similar in intent.
>> Why have unicode() just for converting strings to Unicode strings,
> and unistr() for converting everything else to a Unicode string?
> What does unistr(x) do differently from unicode(x) if x is a string?

MAL responded:
> unistr() is meant to complement str() very closely. unicode()
> works as constructor for Unicode objects which can also take
> care of decoding encoded data. str() and unistr() don't provide
> this capability but instead always assume the default encoding.
>> There's also a subtle difference in that str() and unistr() 
> try the tp_str slot which unicode() doesn't. unicode()
> supports any character buffer which str() and unistr() don't.

Okay, given this explanation, i still feel fairly confident
that unicode() should subsume unistr(). Many of the other
type-named functions try various slots:
 int() looks for __int__
 float() looks for __float__
 long() looks for __long__
 str() looks for __str__
In testing this i also discovered the following:
 >>> class Foo:
 ... def __int__(self):
 ... return 3
 ... 
 >>> f = Foo()
 >>> int(f)
 3
 >>> long(f) 
 Traceback (most recent call last):
 File "<stdin>", line 1, in ?
 AttributeError: Foo instance has no attribute '__long__'
 >>> float(f)
 Traceback (most recent call last):
 File "<stdin>", line 1, in ?
 AttributeError: Foo instance has no attribute '__float__'
This is kind of surprising. How about:
 int() looks for __int__
 float() looks for __float__, then tries __int__
 long() looks for __long__, then tries __int__
 str() looks for __str__
 unicode() looks for __unicode__, then tries __str__
The extra parameter to unicode() is very similar to the extra
parameter to int(), so i think there is a natural parallel here.
Hmm... what about the other types?
Wow!! __complex__ can produce a segfault!
 >>> complex
 <built-in function complex>
 >>> class Foo:
 ... def __complex__(self): return 3
 ... 
 >>> Foo()
 <__main__.Foo instance at 0x81e8684>
 >>> f = _
 >>> complex(f)
 Segmentation fault (core dumped)
This happens because builtin_complex first retrieves and saves
the PyNumberMethods of the argument (in this case, from the
instance), then tries to call __complex__ (in this case, returning 3),
and THEN coerces the result using nbr->nb_float if the result is
not complex! (This calls the instance's nb_float method on the
integer object 3!!)
I think __complex__ should probably look for __complex__, then
__float__, then __int__.
One could argue for __list__, __tuple__, or __dict__, but that
seems much weaker; the Pythonic way has always been to implement
__getitem__ instead. There is no built-in dict(); if it existed
i suppose it would do the opposite of x.items(); again a weak
argument, though i might have found such a function useful once
or twice.
And that about covers the built-in types for data.
-- ?!ng