[Python-checkins] peps: Added unicode literals pepe

Sat Feb 25 19:53:04 CET 2012

http://hg.python.org/peps/rev/506fb28ce80f
changeset: 4082:506fb28ce80f
user: Armin Ronacher <armin.ronacher at active-4.com>
date: Sat Feb 25 18:52:45 2012 +0000
summary:
 Added unicode literals pepe
files:
 pep-0414.txt | 256 +++++++++++++++++++++++++++++++++++++++
 1 files changed, 256 insertions(+), 0 deletions(-)

diff --git a/pep-0414.txt b/pep-0414.txt
new file mode 100644
--- /dev/null
+++ b/pep-0414.txt
@@ -0,0 +1,256 @@
+PEP: 414
+Title: Explicit Unicode Literal for Python 3.3
+Version: $Revision$
+Last-Modified: $Date$
+Author: Armin Ronacher <armin.ronacher at active-4.com>
+Status: Draft
+Type: Standards Track
+Content-Type: text/x-rst
+Created: 15-Feb-2012
+
+
+Abstract
+========
+
+This document proposes the reintegration of an explicit unicode literal
+from Python 2.x to the Python 3.x language specification, in order to
+enable side-by-side support of libraries for both Python 2 and Python 3
+without the need for an explicit 2to3 run.
+
+
+Rationale and Goals
+===================
+
+Python 3 is a major new revision of the language, and it was decided very
+early on that breaking backwards compatibility was part of the design. The
+migration from a Python 2.x to a Python 3 codebase is to be accomplished
+with the aid of a separate translation tool that converts the Python 2.x
+sourcecode to Python 3 syntax. With more and more libraries supporting
+Python 3, however, it has become clear that 2to3 as a tool is
+insufficient, and people are now attempting to find ways to make the same
+source work in both Python 2.x and Python 3.x, with varying levels of
+success.
+
+Python 2.6 and Python 2.7 support syntax features from Python 3 which for
+the most part make a unified code base possible. Many thought that the
+``unicode_literals`` future import might make a common source possible,
+but it turns out that it's doing more harm than good.
+
+With the design of the updated WSGI specification a few new terms for
+strings were loosely defined: unicode strings, byte strings and native
+strings. In Python 3 the native string type is unicode, in Python 2 the
+native string type is a bytestring. These native string types are used in
+a couple of places. The native string type can be interned and is
+preferably used for identifier names, filenames, source code and a few
+other low level interpreter operations such as the return value of a
+``__repr__`` or exception messages.
+
+In Python 2.7 these string types can be defined explicitly. Without any
+future imports ``b'foo'`` means bytestring, ``u'foo'`` declares a unicode
+string and ``'foo'`` a native string which in Python 2.x means bytes.
+With the ``unicode_literals`` import the native string type is no longer
+available and has to be incorrectly labeled as bytestring. If such a
+codebase is then used in Python 3, the interpreter will start using byte
+objects in places where they are no longer accepted (such as identifiers).
+This can be solved by a module that detects 2.x and 3.x and provides
+wrapper functions that transcode literals at runtime. Unfortunately, this
+has the side effect of slowing down the runtime performance of Python and
+makes for less beautiful code. Considering that Python 2 and Python 3
+support for most libraries will have to continue side by side for several
+more years to come, this means that such modules lose one of Python's key
+properties: easily readable and understandable code.
+
+Additionally, the vast majority of people who maintain Python 2.x
+codebases are more familiar with Python 2.x semantics, and a per-file
+difference in literal meanings will be very annoying for them in the long
+run. A quick poll on Twitter about the use of the division future import
+supported my suspicions that people opt out of behaviour-changing future
+imports because they are a maintenance burden. Every time you review code
+you have to check the top of the file to see if the behaviour was changed.
+Obviously that was an unscientific informal poll, but it might be
+something worth considering.
+
+Proposed Solution
+=================
+
+The idea is to support (with Python 3.3) an explicit ``u`` and ``U``
+prefix for native strings in addition to the prefix-less variants. These
+would stick around for the entirety of the Python 3 lifetime but might at
+some point yield deprecation warnings if deemed appropriate. This could
+be something for pyflakes or other similar libraries to support.
+
+Python 3.2 and earlier
+======================
+
+An argument against this proposal was made on the Python-Dev mailinglist,
+mentioning that Ubuntu LTS will ship Python 3.2 and 2.7 for only 5 years.
+The counterargument is that Python 2.7 is currently the Python version of
+choice for users who want LTS support. As it stands, Python 3 is
+currently a bad choice for long-term investments, since the ecosystem is
+not yet properly developed, and libraries are still fighting with their
+API decisions for Python 3.
+
+A valid point is that this would encourage people to become dependent on
+Python 3.3 for their ports. Fortunately that is not a big problem since
+that could be fixed at installation time similar to how many projects are
+currently invoking 2to3 as part of their installation process.
+
+For Python 3.1 and Python 3.2 (even 3.0 if necessary) a simple
+on-installation hook could be provided that tokenizes all source files and
+strips away the otherwise unnecessary ``u`` prefix at installation time.
+
+Who Benefits?
+=============
+
+There are a couple of places where decisions have to be made for or
+against unicode support almost arbitrarily. This is mostly the case for
+protocols that do not support unicode all the way down, or hide it behind
+transport encodings that might or might not be unicode themselves. HTTP,
+Email and WSGI are good examples of that. For certain ambiguous cases it
+would be possible to apply the same logic for unicode that Python 3
+applies to the Python 2 versions of the library as well but, if those
+details were exposed to the user of the API, it would mean breaking
+compatibility for existing users of the Python 2 API which is a no-go for
+many situations. The automatic upgrading of binary strings to unicode
+strings that would be enabled by this proposal would make it much easier
+to port such libraries over.
+
+Not only the libraries but also the users of these APIs would benefit from
+that. For instance, the urllib module in Python 2 is using byte strings,
+and the one in Python 3 is using unicode strings. By leveraging a native
+string, users can avoid having to adjust for that.
+
+Problems with 2to3
+==================
+
+In practice 2to3 currently suffers from a few problems which make it
+unnecessarily difficult and/or unpleasant to use:
+
+- Bad overall performance. In many cases 2to3 runs one or two orders of
+ magnitude slower than the testsuite for the library or application
+ it's testing.
+- Slightly different behaviour in 2to3 between different versions of
+ Python cause different outcomes when paired with custom fixers.
+- Line numbers from error messages do not match up with the real source
+ lines due to added/rewritten imports.
+- extending 2to3 with custom fixers is nontrivial without using
+ distribute. By default 2to3 works acceptably well for upgrading
+ byte-based APIs to unicode based APIs but it fails to upgrade APIs
+ which already support unicode to Python 3::
+
+ --- test.py (original)
+ +++ test.py (refactored)
+ @@ -1,5 +1,5 @@
+ class Foo(object):
+ def __unicode__(self):
+ - return u'test'
+ + return 'test'
+ def __str__(self):
+ - return unicode(self).encode('utf-8')
+ + return str(self).encode('utf-8')
+
+
+APIs and Concepts Using Native Strings
+======================================
+
+The following is an incomplete list of APIs and general concepts that use
+native strings and need implicit upgrading to unicode in Python 3, and
+which would directly benefit from this support:
+
+- Python identifiers (dict keys, class names, module names, import
+ paths)
+- URLs for the most part as well as HTTP headers in urllib/http servers
+- WSGI environment keys and CGI-inherited values
+- Python source code for dynamic compilation and AST hacks
+- Exception messages
+- ``__repr__`` return value
+- preferred filesystem paths
+- preferred OS environment
+
+
+Modernizing Code
+================
+
+The 2to3 tool can be easily adjusted to generate code that runs on both
+Python 2 and Python 3. An experimental extension to 2to3 which only
+modernizes Python code to the extent that it runs on Python 2.7 or later
+with support for the ``six`` library is available as python-modernize
+[1]_. For most cases the runtime impact of ``six`` can be neglected (like
+a function that calls ``iteritems()`` on a passed dictionary under 2.x or
+``items()`` under 3.x), but to make strings cheap for both 2.x and 3.x it
+is nearly impossible. The way it currently works is by abusing the
+``unicode-escape`` codec on Python 2.x native strings. This is especially
+ugly if such a string literal is used in a tight loop.
+
+This proposal would fix this. The modernize module could easily be
+adjusted to simply not translate unicode strings, and the runtime overhead
+would disappear.
+
+Possible Downsides
+==================
+
+The obvious downside for this is that potential Python 3 users would have
+to be aware of the fact that ``u`` is an optional prefix for strings.
+This is something that Python 3 in general tried to avoid. The second
+inequality comparison operator was removed, the ``L`` prefix for long
+integers etc. This PEP would propose a slight revert on that practice by
+reintroducing redundant syntax. On the other hand, Python already has
+multiple literals for strings with mostly the same behavior (single
+quoted, double quoted, single triple quoted, double triple quoted).
+
+Runtime Overhead of Wrappers
+============================
+
+I did some basic timings on the performance of a ``u()`` wrapper function
+as used by the `six` library. The implementation of ``u()`` is as
+follows::
+
+ if sys.version_info >= (3, 0):
+ def u(value):
+ return value
+ else:
+ def u(value):
+ return unicode(value, 'unicode-escape')
+
+The intention is that ``u'foo'`` can be turned to ``u('foo')`` and that on
+Python 2.x an implicit decoding happens. In this case the wrapper will
+have a decoding overhead for Python 2.x. I did some basic timings [2]_ to
+see how bad the performance loss would be. The following examples measure
+the execution time over 10000 iterations::
+
+ u'\N{SNOWMAN}barbaz' 1000 loops, best of 3: 295 usec per loop
+ u('\N{SNOWMAN}barbaz') 10 loops, best of 3: 18.5 msec per loop
+ u'foobarbaz_%d' % x 100 loops, best of 3: 8.32 msec per loop
+ u('foobarbaz_%d') % x 10 loops, best of 3: 25.6 msec per loop
+ u'fööbarbaz' 1000 loops, best of 3: 289 usec per loop
+ u('fööbarbaz') 100 loops, best of 3: 15.1 msec per loop
+ u'foobarbaz' 1000 loops, best of 3: 294 usec per loop
+ u('foobarbaz') 100 loops, best of 3: 14.3 msec per loop
+
+The overhead of the wrapper function in Python 3 is the price of a
+function call since the function only has to return the argument
+unchanged.
+
+
+References
+==========
+
+.. [1] Python-Modernize
+ (http://github.com/mitsuhiko/python-modernize)
+.. [2] Benchmark
+ (https://github.com/mitsuhiko/unicode-literals-pep/blob/master/timing.py)
+
+
+Copyright
+=========
+
+This document has been placed in the public domain.
+
+
+..
+ Local Variables:
+ mode: indented-text
+ indent-tabs-mode: nil
+ sentence-end-double-space: t
+ fill-column: 70
+ End:
-- 
Repository URL: http://hg.python.org/peps