[Python-Dev] PEP: Generalised String Coercion

Sat Aug 6 12:23:42 CEST 2005

The title is perhaps a little too grandiose but it's the best I
could think of. The change is really not large. Personally, I
would be happy enough if only %s was changed and the built-in was
not added. Please comment.
 Neil
PEP: 349
Title: Generalised String Coercion
Version: $Revision: 1.2 $
Last-Modified: $Date: 2005年08月06日 04:05:48 $
Author: Neil Schemenauer <nas at arctrix.com>
Status: Draft
Type: Standards Track
Content-Type: text/plain
Created: 02-Aug-2005
Post-History: 06-Aug-2005
Python-Version: 2.5
Abstract
 This PEP proposes the introduction of a new built-in function,
 text(), that provides a way of generating a string representation
 of an object without forcing the result to be a particular string
 type. In addition, the behavior %s format specifier would be
 changed to call text() on the argument. These two changes would
 make it easier to write library code that can be used by
 applications that use only the str type and by others that also
 use the unicode type.
Rationale
 Python has had a Unicode string type for some time now but use of
 it is not yet widespread. There is a large amount of Python code
 that assumes that string data is represented as str instances.
 The long term plan for Python is to phase out the str type and use
 unicode for all string data. Clearly, a smooth migration path
 must be provided.
 We need to upgrade existing libraries, written for str instances,
 to be made capable of operating in an all-unicode string world.
 We can't change to an all-unicode world until all essential
 libraries are made capable for it. Upgrading the libraries in one
 shot does not seem feasible. A more realistic strategy is to
 individually make the libraries capable of operating on unicode
 strings while preserving their current all-str environment
 behaviour.
 First, we need to be able to write code that can accept unicode
 instances without attempting to coerce them to str instances. Let
 us label such code as Unicode-safe. Unicode-safe libraries can be
 used in an all-unicode world.
 Second, we need to be able to write code that, when provided only
 str instances, will not create unicode results. Let us label such
 code as str-stable. Libraries that are str-stable can be used by
 libraries and applications that are not yet Unicode-safe.

 Sometimes it is simple to write code that is both str-stable and
 Unicode-safe. For example, the following function just works:
 def appendx(s):
 return s + 'x'
 That's not too surprising since the unicode type is designed to
 make the task easier. The principle is that when str and unicode
 instances meet, the result is a unicode instance. One notable
 difficulty arises when code requires a string representation of an
 object; an operation traditionally accomplished by using the str()
 built-in function.

 Using str() makes the code not Unicode-safe. Replacing a str()
 call with a unicode() call makes the code not str-stable. Using a
 string format almost accomplishes the goal but not quite.
 Consider the following code:
 def text(obj):
 return '%s' % obj
 It behaves as desired except if 'obj' is not a basestring instance
 and needs to return a Unicode representation of itself. In that
 case, the string format will attempt to coerce the result of
 __str__ to a str instance. Defining a __unicode__ method does not
 help since it will only be called if the right-hand operand is a
 unicode instance. Using a unicode instance for the right-hand
 operand does not work because the function is no longer str-stable
 (i.e. it will coerce everything to unicode).
Specification
 A Python implementation of the text() built-in follows:
 def text(s):
 """Return a nice string representation of the object. The
 return value is a basestring instance.
 """
 if isinstance(s, basestring):
 return s
 r = s.__str__()
 if not isinstance(r, basestring):
 raise TypeError('__str__ returned non-string')
 return r

 Note that it is currently possible, although not very useful, to
 write __str__ methods that return unicode instances.
 The %s format specifier for str objects would be changed to call
 text() on the argument. Currently it calls str() unless the
 argument is a unicode instance (in which case the object is
 substituted as is and the % operation returns a unicode instance).
 The following function would be added to the C API and would be the
 equivalent of the text() function:
 PyObject *PyObject_Text(PyObject *o);
 A reference implementation is available on Sourceforge [1] as a
 patch.

Backwards Compatibility
 The change to the %s format specifier would result in some %
 operations returning a unicode instance rather than raising a
 UnicodeDecodeError exception. It seems unlikely that the change
 would break currently working code.
Alternative Solutions
 Rather than adding the text() built-in, if PEP 246 were
 implemented then adapt(s, basestring) could be equivalent to
 text(s). The advantage would be one less built-in function. The
 problem is that PEP 246 is not implemented.
 Fredrik Lundh has suggested [2] that perhaps a new slot should be
 added (e.g. __text__), that could return any kind of string that's
 compatible with Python's text model. That seems like an
 attractive idea but many details would still need to be worked
 out.
 Instead of providing the text() built-in, the %s format specifier
 could be changed and a string format could be used instead of
 calling text(). However, it seems like the operation is important
 enough to justify a built-in.
 Instead of providing the text() built-in, the basestring type
 could be changed to provide the same functionality. That would
 possibly be confusing behaviour for an abstract base type.
 Some people have suggested [3] that an easier migration path would
 be to change the default encoding to be UTF-8. Code that is not
 Unicode safe would then encode Unicode strings as UTF-8 and
 operate on them as str instances, rather than raising a
 UnicodeDecodeError exception. Other code would assume that str
 instances were encoded using UTF-8 and decode them if necessary.
 While that solution may work for some applications, it seems
 unsuitable as a general solution. For example, some applications
 get string data from many different sources and assuming that all
 str instances were encoded using UTF-8 could easily introduce
 subtle bugs.
References
 [1] http://www.python.org/sf/1159501
 [2] http://mail.python.org/pipermail/python-dev/2004-September/048755.html
 [3] http://blog.ianbicking.org/illusive-setdefaultencoding.html
Copyright
 This document has been placed in the public domain.

Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End: