This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2012年10月17日 02:19 by trent, last changed 2022年04月11日 14:57 by admin.
| Messages (17) | |||
|---|---|---|---|
| msg173124 - (view) | Author: Trent Nelson (trent) * (Python committer) | Date: 2012年10月17日 02:19 | |
====================================================================== ERROR: test_strxfrm (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 346, in test_strxfrm self.assertLess(locale.strxfrm('a'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ====================================================================== ERROR: test_strxfrm_with_diacritic (test.test_locale.TestEnUSCollation) ---------------------------------------------------------------------- Traceback (most recent call last): File "/home/cpython/buildslave/3.x.snakebite-solaris10-u10ga2-sparc64-1/build/Lib/test/test_locale.py", line 367, in test_strxfrm_with_diacritic self.assertLess(locale.strxfrm('à'), locale.strxfrm('b')) ValueError: character U+101010e is not in range [U+0000; U+10ffff] ---------------------------------------------------------------------- Haven't investigated yet. |
|||
| msg173164 - (view) | Author: Trent Nelson (trent) * (Python committer) | Date: 2012年10月17日 12:56 | |
With the caveat that I know absolutely nothing about locales, here's what I've been able to reduce the problem down to:
zinc (alias s11, Solaris 11 x64):
>>> locale.setlocale(locale.LC_ALL, 'C')
'C'
>>> locale.strxfrm('a')
'a'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+10105a3 is not in range [U+0000; U+10ffff]
>>>
nitrogen (alias s10, Solaris 10 SPARC):
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: character U+101010e is not in range [U+0000; U+10ffff]
Not sure how relevant it is, but on both those Solaris boxes, locale.LC_ALL returns 6, whereas on BSD and OS X it always seems to return 0.
|
|||
| msg173166 - (view) | Author: Jesús Cea Avión (jcea) * (Python committer) | Date: 2012年10月17日 13:02 | |
I can reproduce this on my x86 Solaris 10 update 10. |
|||
| msg173167 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年10月17日 13:03 | |
With the system Python on s10:
Python 2.6.8 (unknown, Apr 13 2012, 17:08:12) [C] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.strxfrm('a')
'a'
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'
>>> locale.strxfrm('a').decode('utf-8')
u'\x01\x01\x01\x0e\x01\x01\x01\x01\x01\x01\x01\x02\x01\x01\x0fi\x01\x01\x01\x01'
The difference between Python 2 and Python 3 is that Python 3 uses wcsxfrm, not strxfrm. Apparently Solaris' wcsxfrm is some broken thing that returns the same thing as strxfrm, cast to a wchar_t *, hence the character U+101010e (corresponding to the '\x01\x01\x01\x0e' bytestring above).
|
|||
| msg173168 - (view) | Author: Jesús Cea Avión (jcea) * (Python committer) | Date: 2012年10月17日 13:05 | |
BTW, this works in python 3.2:
x86, 32 bit python, Solaris 10 update 10:
"""
Python 3.2.3 (default, Apr 12 2012, 13:29:13)
[GCC 4.7.0] on sunos5
Type "help", "copyright", "credits" or "license" for more information.
>>> import locale
>>> locale.setlocale(locale.LC_ALL, 'en_US.UTF-8')
'en_US.UTF-8'
>>> locale.strxfrm('a')
'���\U00010f69�'
"""
|
|||
| msg173171 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年10月17日 13:34 | |
It only works on Python 3.2 because PyUnicode_FromWideChar is more permissive, it seems. The first character in the wchar_t string returned by Solaris is still 0x101010e. |
|||
| msg173172 - (view) | Author: Antoine Pitrou (pitrou) * (Python committer) | Date: 2012年10月17日 13:44 | |
(by the way, I also tried a memset() before calling wcsxfrm(): no change) |
|||
| msg173199 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2012年10月17日 19:28 | |
Python 3.2 rejects characters outside the range U+0000-U+10ffff in some operations, but not everywhere. I fixed Python 3.3 to be more strict and always reject characters outside this range. I noticed the Solaris issue with mbstowcs() on locale encodings different than UTF-8: #13441. I asked if it's more important to be strict on Unicode, or if we need to handle the wcsxfrm() issue on python-dev: http://mail.python.org/pipermail/python-dev/2011-December/114759.html Stefan Krah answered: "Yes, if the cause is a broken mbstowcs() that sounds good." http://mail.python.org/pipermail/python-dev/2011-December/114781.html I asked for help on OpenIndiana IRC channel, but nobody had a locale encoding different than UTF-8. I didn't have access to a Solaris box, so I chose to skip failing tests on Solaris. My commit 2a2d0872d993 (and 7ffe3d304487) skips many locales to workaround this issue in test__locale. |
|||
| msg289382 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年03月10日 15:51 | |
May be issue15954 is related to this issue. Is this issue still reproduced? |
|||
| msg296414 - (view) | Author: Peter (petriborg) | Date: 2017年06月20日 12:16 | |
I'm getting the same 2 errors in Python 3.4.6 on Solaris 11. Comes up when you run 'gmake test' or ./python -W default -bb -E -W error::BytesWarning -m test -r -w -j 0 -v test_locale.py |
|||
| msg296415 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2017年06月20日 12:23 | |
A solution for that would be to return the raw byte string or to return a list of integers, rather than an unicode string. I don't think that locale.strxfrm() result is supposed to be displayed in a terminal, it should only be used to sort two strings, or to be used as a key function for list.sort() for example. |
|||
| msg296416 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2017年06月20日 12:26 | |
Currently, the function is documented to return a string: https://docs.python.org/dev/library/locale.html#locale.strxfrm "Transforms a string to one that can be used in locale-aware comparisons." The problem is that we don't have enough developers who care of Solaris/Illimios to fix these issues (propose patches). test_locale is just *one* example. The curses module is broken for years on Solaris if I recall correctly... |
|||
| msg296418 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年06月20日 12:47 | |
It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). All codes < 0x10000 are not changed. Codes >= 0x10000 are encoded as a pair: 0x10000 + (code >> 16), code & 0xffff. |
|||
| msg296435 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2017年06月20日 14:20 | |
> It is possible to use the special "encoding" for transformed strings on platforms with broken wcsxfrm(). I wouldn't say that the function is wrong. wchar_t is 32-bit long, the function is free to use numbers > 0x10ffff. It's more a Python limitation, no? |
|||
| msg296440 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年06月20日 14:36 | |
Agree, it's more a Python limitation. |
|||
| msg296441 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2017年06月20日 14:38 | |
> Agree, it's more a Python limitation. Why do you think of changing locale.strxfrm() from str to bytes or tuple? I prefer a tuple. But again, I'm not super motivated by this change. IMHO there are more severe issues that should be fixed in Solaris. |
|||
| msg296445 - (view) | Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) | Date: 2017年06月20日 14:54 | |
This will change the documented behavior. Even if allow this change in a new feature release, it can't be made in maintained releases. A tuple of integers is memory excessive and slow. A bytes object is more compact (but may be less compact than a string) and faster. But on little-endian platform every wchar_t should be converted to big-endian for supporting comparison of bytes objects. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:37 | admin | set | github: 60462 |
| 2017年06月20日 14:54:41 | serhiy.storchaka | set | messages: + msg296445 |
| 2017年06月20日 14:38:53 | vstinner | set | messages: + msg296441 |
| 2017年06月20日 14:36:13 | serhiy.storchaka | set | messages: + msg296440 |
| 2017年06月20日 14:20:32 | vstinner | set | messages: + msg296435 |
| 2017年06月20日 12:48:36 | serhiy.storchaka | set | components: + Extension Modules, - Interpreter Core |
| 2017年06月20日 12:47:47 | serhiy.storchaka | set | type: behavior messages: + msg296418 components: + Interpreter Core versions: + Python 3.5, Python 3.6, Python 3.7, - Python 3.3, Python 3.4 |
| 2017年06月20日 12:26:30 | pitrou | set | nosy:
- pitrou |
| 2017年06月20日 12:26:12 | vstinner | set | messages: + msg296416 |
| 2017年06月20日 12:23:29 | vstinner | set | messages: + msg296415 |
| 2017年06月20日 12:16:35 | petriborg | set | nosy:
+ petriborg messages: + msg296414 |
| 2017年03月10日 15:51:28 | serhiy.storchaka | set | nosy:
+ serhiy.storchaka messages: + msg289382 |
| 2012年10月17日 19:28:28 | vstinner | set | messages: + msg173199 |
| 2012年10月17日 14:36:26 | jcea | set | nosy:
+ vstinner |
| 2012年10月17日 14:35:41 | jcea | link | issue13441 superseder |
| 2012年10月17日 13:44:36 | pitrou | set | messages: + msg173172 |
| 2012年10月17日 13:34:00 | pitrou | set | messages: + msg173171 |
| 2012年10月17日 13:05:36 | jcea | set | keywords:
+ 3.3regression messages: + msg173168 |
| 2012年10月17日 13:03:20 | pitrou | set | nosy:
+ loewis, pitrou messages: + msg173167 |
| 2012年10月17日 13:02:59 | jcea | set | messages: + msg173166 |
| 2012年10月17日 12:56:34 | trent | set | messages: + msg173164 |
| 2012年10月17日 03:08:51 | jcea | set | nosy:
+ jcea |
| 2012年10月17日 02:19:55 | trent | create | |