homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: unicode format does not really work in Python 2.x
Type: behavior Stage: resolved
Components: Interpreter Core, Unicode Versions: Python 2.7
process
Status: closed Resolution: out of date
Dependencies: 15952 Superseder:
Assigned To: Nosy List: Arfrever, Ariel.Ben-Yehuda, chris.jerdonek, eric.smith, ezio.melotti, loewis, petr.dlouhy@email.cz, serhiy.storchaka, vstinner
Priority: normal Keywords:

Created on 2012年07月07日 13:40 by Ariel.Ben-Yehuda, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Messages (19)
msg164844 - (view) Author: Ariel Ben-Yehuda (Ariel.Ben-Yehuda) Date: 2012年07月07日 13:40
unicode formats (u'{:n}'.format) in python 2.x assume that the thousands seperator is in ascii, so this fails:
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fra') # or fr_FR on UNIX
>>> u'{:n}'.format(10000)
Traceback (most recent call last):
 File "<pyshell#3>", line 1, in <module>
 u'{:n}'.format(10000)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)
However, it works correctly in python 3, properly returning '10\xA00000' (the \xA0 is a nbsp)
msg164847 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年07月07日 13:54
Cf. the related issue 7300: "Unicode arguments in str.format()".
msg164892 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年07月07日 15:57
Ariel: would you like to provide a patch?
msg164902 - (view) Author: Ariel Ben-Yehuda (Ariel.Ben-Yehuda) Date: 2012年07月07日 16:11
I don't work on CPython
On Sat, Jul 7, 2012 at 6:57 PM, Martin v. Löwis <report@bugs.python.org>wrote:
>
> Martin v. Löwis <martin@v.loewis.de> added the comment:
>
> Ariel: would you like to provide a patch?
>
> ----------
> nosy: +loewis
>
> _______________________________________
> Python tracker <report@bugs.python.org>
> <http://bugs.python.org/issue15276>
> _______________________________________
>
msg164986 - (view) Author: Berker Peksag (berker.peksag) * (Python committer) Date: 2012年07月08日 10:02
I can't reproduce this with Python 2.7.3.
berker@wakefield ~[master*]$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'fr_FR')
'fr_FR'
>>> u'{:n}'.format(10000)
u'10 000'
msg165006 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2012年07月08日 11:44
I confirm the bug on 2.7.
$ ./python 
Python 2.7.3+ (2.7:ab9d6c4907e7+, Apr 25 2012, 20:02:36) 
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_NUMERIC, 'uk_UA.UTF-8')
'uk_UA.UTF-8'
>>> u'{:n}'.format(10000)
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 2: ordinal not in range(128)
>>> '{:n}'.format(10000)
'10\xc2\xa0000'
msg170570 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月16日 19:06
I can't yet reproduce on my system, but after looking at the code, I believe the following are closer to the cause:
>>> format(10000, u'n')
>>> int.__format__(10000, u'n')
Incidentally, on my system, the following note in the docs is wrong:
"Note: format(value, format_spec) merely calls value.__format__(format_spec)."
(from http://docs.python.org/library/functions.html?#format )
>>> format(10000, u'n')
u'10000'
>>> 10000.__format__(u'n')
 File "<stdin>", line 1
 10000.__format__(u'n')
 ^
SyntaxError: invalid syntax
>>> int.__format__(10000, u'n')
'10000'
Observe also that format() and int.__format__() return different types.
msg170572 - (view) Author: Eric V. Smith (eric.smith) * (Python committer) Date: 2012年09月16日 19:19
The case with 10000.__format__ is confusing the parser. It sees:
<floating point number 10000.> __format__
which is indeed a syntax error.
Try:
>>> 10000 .__format__(u'n')
'10000'
or:
>>> (10000).__format__(u'n')
'10000'
msg170573 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月16日 19:26
> The case with 10000.__format__ is confusing the parser.
Interesting, good catch! That error did seem unusual. The two modified forms do give the same result as int.__format__() (though the type still differs).
msg170581 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月17日 02:02
I did some analysis of this issue.
For starters, I could not reproduce this on Mac OS X 10.7.4. I iterated through all available locales, and the separator was ASCII in all cases.
Instead, I was able to fake the issue by changing "," to "\xa0" in the following line--
http://hg.python.org/cpython/file/820032281f49/Objects/stringlib/formatter.h#l651
and then reproduce with:
>>> u'{:,}'.format(10000)
 ..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)
>>> format(10000, u',')
 ..
UnicodeDecodeError: 'ascii' codec can't decode byte 0xa0 in position 2: ordinal not in range(128)
However, note this difference (see also issue 15952)--
>>> (10000).__format__(u',')
'10\xa0000'
The issue seems to be that PyObject_Format() in Objects/abstract.c (which, unlike int__format__() in Objects/intobject.c, does respect whether the format string is unicode or not) calls int__format__() to get the formatted string as a byte string. It then passes this to PyObject_Unicode() to convert to unicode. This in turn calls PyUnicode_FromEncodedObject() with a NULL encoding, which causes that code to use PyUnicode_GetDefaultEncoding() for the encoding (i.e. sys.getdefaultencoding()).
The right way to fix this seems to be to make int__format__() return unicode as appropriate, which may mean modifying formatter.h's format_int_or_long_internal() to return unicode -- as well as taking into account the locale encoding when accessing the locale's thousands separator.
msg170586 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月17日 06:22
Eric, it looks like you wrote this comment:
/* don't define FORMAT_LONG, FORMAT_FLOAT, and FORMAT_COMPLEX, since
 we can live with only the string versions of those. The builtin
 format() will convert them to unicode. */
in http://hg.python.org/cpython/file/19601d451d4c/Python/formatter_unicode.c
It seems like the current issue may be a valid reason for introducing a unicode FORMAT_INT (i.e. not just for type-purity and PEP 3101 compliance, but to avoid an exception). What do you think?
msg170719 - (view) Author: Martin v. Löwis (loewis) * (Python committer) Date: 2012年09月19日 09:45
> What do you think?
[Even though I wasn't asked]
I think we may need to close the issue as "won't fix". Depending on the
exact change propsosed, it may be that the return type for existing
operations might change, which shouldn't be done in a bug fix release.
People running into this issue should port to Python 3 (IMO).
msg170778 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月20日 00:23
If we don't fix this (I'm leaning that way myself), I think we should somehow document the limitation. There are ways to acknowledge the limitation without getting into the specifics of this particular issue.
msg170801 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年09月20日 11:45
I fixed a similar bug in Python 3.3: issue #13706.
changeset: 75231:f89e2f4cda88
user: Victor Stinner <victor.stinner@haypocalc.com>
date: Fri Feb 24 00:37:51 2012 +0100
files: Include/unicodeobject.h Lib/test/test_format.py Objects/stringlib/asciilib.h Objects/stringlib/localeutil.h Objects/stringlib/stringdefs.h Objects/stringlib/ucs1lib.h 
description:
Issue #13706: Fix format(int, "n") for locale with non-ASCII thousands separator
 * Decode thousands separator and decimal point using PyUnicode_DecodeLocale()
 (from the locale encoding), instead of decoding them implicitly from latin1
 * Remove _PyUnicode_InsertThousandsGroupingLocale(), it was not used
 * Change _PyUnicode_InsertThousandsGrouping() API to return the maximum
 character if unicode is NULL
 * Replace MIN/MAX macros by Py_MIN/Py_MAX
 * stringlib/undef.h undefines STRINGLIB_IS_UNICODE
 * stringlib/localeutil.h only supports Unicode
msg170802 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年09月20日 12:03
> I can't reproduce this with Python 2.7.3.
> >>> locale.setlocale(locale.LC_NUMERIC, 'fr_FR')
> 'fr_FR'
> >>> u'{:n}'.format(10000)
> u'10 000'
I don't understand why, but the all french locales are the same. Some "french locale" uses the standard ASCII space (U+0020) as thousand seperator, others use the non-breaking space (U+00A0). I suppose that some systems prefer to avoid non-ASCII characters to avoid "Unicode issues".
On Ubuntu 12.04, locale.localeconv()['thousands_sep'] is chr(32) for the locale fr_FR.utf8.
You may need to install other locales to test this issue. For example, the ps_AF locale uses U+066b as the decimal point and the thousands separator.
I chose to not fix the issue in Python 3.2 because it needs to change too much code (and I don't want to introduce a regression and 3.2 code is very different than 3.3). You should upgrade to Python 3.3, or reimplement the Unicode format() function for numbers using locale.localeconv() ('thousands_sep', 'decimal_point' and 'grouping') :-/
Or find a more motivated developer. Or I can do the job if you pay me ;-)
(Read also the issue #13706 for more information.)
msg171011 - (view) Author: Chris Jerdonek (chris.jerdonek) * (Python committer) Date: 2012年09月22日 18:32
I have a brief documentation patch in mind for this, but it relies on documentation issue 15952 being addressed first (e.g. to say that format(value) returns Unicode when format_spec is Unicode and that value.__format__() can return a string of type str). So I'm marking issue 15952 as a dependency.
msg174846 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2012年11月04日 23:45
"If we don't fix this (I'm leaning that way myself), I think we should somehow document the limitation. There are ways to acknowledge the limitation without getting into the specifics of this particular issue."
I agree to documentation the limitation and close this issue as "wontfix".
A workaround is to format as a bytes string, and then decode the result from the locale encoding. It looks like locale.getpreferredencoding(True) should be used, not locale.getpreferredencoding(False).
msg216689 - (view) Author: Petr Dlouhý (petr.dlouhy@email.cz) Date: 2014年04月17日 14:14
For anyone stuck on Python 2.x, here is an workaround (maybe it could find it's way to documentation also):
 def fix_grouping(bytestring):
 try:
 return unicode(bytestring)
 except UnicodeDecodeError:
 return bytestring.decode("utf-8")
msg370432 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2020年05月31日 12:52
Python 2.7 is no longer supported.
History
Date User Action Args
2022年04月11日 14:57:32adminsetgithub: 59481
2020年05月31日 12:52:02serhiy.storchakasetmessages: + msg370432
2020年05月31日 12:11:35serhiy.storchakasetstatus: open -> closed
resolution: out of date
stage: resolved
2014年04月17日 14:14:08petr.dlouhy@email.czsetnosy: + petr.dlouhy@email.cz
messages: + msg216689
2012年11月04日 23:45:38vstinnersetmessages: + msg174846
2012年11月04日 17:52:11berker.peksagsetnosy: - berker.peksag
2012年09月22日 18:32:04chris.jerdoneksetdependencies: + format(value) and value.__format__() behave differently with unicode format
messages: + msg171011
2012年09月20日 12:03:48vstinnersetmessages: + msg170802
2012年09月20日 11:45:20vstinnersetmessages: + msg170801
2012年09月20日 11:40:16vstinnersetnosy: + vstinner
2012年09月20日 00:23:05chris.jerdoneksetmessages: + msg170778
2012年09月19日 09:45:42loewissetmessages: + msg170719
2012年09月17日 06:22:07chris.jerdoneksetmessages: + msg170586
2012年09月17日 02:02:22chris.jerdoneksetmessages: + msg170581
2012年09月16日 19:33:02Arfreversetnosy: + Arfrever
2012年09月16日 19:26:36chris.jerdoneksetmessages: + msg170573
2012年09月16日 19:19:36eric.smithsetmessages: + msg170572
2012年09月16日 19:06:10chris.jerdoneksetmessages: + msg170570
2012年07月08日 11:44:29serhiy.storchakasetnosy: + ezio.melotti, serhiy.storchaka
messages: + msg165006

components: + Interpreter Core, Unicode
type: behavior
2012年07月08日 10:02:21berker.peksagsetnosy: + berker.peksag
messages: + msg164986
2012年07月07日 16:12:02pitrousetnosy: + eric.smith
2012年07月07日 16:11:08Ariel.Ben-Yehudasetmessages: + msg164902
2012年07月07日 15:57:34loewissetnosy: + loewis
messages: + msg164892
2012年07月07日 13:54:31chris.jerdoneksetnosy: + chris.jerdonek
messages: + msg164847
2012年07月07日 13:40:47Ariel.Ben-Yehudacreate

AltStyle によって変換されたページ (->オリジナル) /