homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Sorting with locale (strxfrm) does not work properly with Python3 on BSD or OS X
Type: behavior Stage:
Components: Unicode Versions: Python 3.4
process
Status: open Resolution:
Dependencies: Superseder:
Assigned To: Nosy List: ezio.melotti, lemburg, ned.deily, pnugues, r.david.murray, vstinner
Priority: normal Keywords:

Created on 2015年01月08日 20:30 by pnugues, last changed 2022年04月11日 14:58 by admin.

Messages (4)
msg233685 - (view) Author: Pierre Nugues (pnugues) Date: 2015年01月08日 20:30
The sorted() function does not work properly with macosx.
Here is a script to reproduce the issue:
import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
a = ["A", "E", "Z", "a", "e", "é", "z"]
sorted(a)
sorted(a, key=locale.strxfrm)
The execution on MacOsX produces:
pierre:Flaubert pierre$ sw_vers -productVersion
10.10.1
pierre:Flaubert pierre$ python3
Python 3.4.2 (v3.4.2:ab2c023a9432, Oct 5 2014, 20:42:22) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
import locale
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
'fr_FR.UTF-8'
a = ["A", "E", "Z", "a", "e", "é", "z"]
sorted(a)
['A', 'E', 'Z', 'a', 'e', 'z', 'é']
sorted(a, key=locale.strxfrm)
['A', 'E', 'Z', 'a', 'e', 'z', 'é']
while it produces this on your interactive shell (python.org):
In [10]: import locale
In [11]: locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
Out[11]: 'fr_FR.UTF-8'
In [12]: a = ["A", "E", "Z", "a", "e", "é", "z"]
In [13]: sorted(a)
Out[13]: ['A', 'E', 'Z', 'a', 'e', 'z', 'é']
In [14]: sorted(a, key=locale.strxfrm)
Out[14]: ['a', 'A', 'e', 'E', 'é', 'z', 'Z']
which is correct.
msg233687 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年01月08日 21:27
locale.strxfrm() have a different implementation in Python 2 and in Python 3:
- Python 2 uses strxfrm(), so works on bytes strings
- Python 3 uses wcsxfrm(), so works on multibyte strings ("unicode" strings)
It looks like Python 2 and 3 have the same behaviour on Mac OS X: the list is not sorted as expected. Test on Mac OS X 10.9.2.
Imac-Photo:~ haypo$ cat collate2.py 
#coding:utf8
import locale, random
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None))
a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"]
random.shuffle(a)
print(sorted(a))
print(sorted(a, key=locale.strxfrm))
Imac-Photo:~ haypo$ cat collate3.py 
#coding:utf8
import locale, random
locale.setlocale(locale.LC_ALL, "fr_FR.UTF-8")
print("LC_COLLATE = %s" % locale.setlocale(locale.LC_COLLATE, None))
a = ["A", "E", "Z", "\xc9", "a", "e", "\xe9", "z"]
random.shuffle(a)
print(ascii(sorted(a)))
print(ascii(sorted(a, key=locale.strxfrm)))
Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 python collate2.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
Imac-Photo:~ haypo$ LC_ALL=fr_FR.utf8 ~/prog/python/default/python.exe ~/collate3.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
On Linux, I get the expected order with Python 3:
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['a', 'A', 'e', 'E', '\xe9', '\xc9', 'z', 'Z']
On Linux, Python 2 gives me a strange order. It's maybe an issue in my program:
haypo@selma$ python x.py 
LC_COLLATE = fr_FR.UTF-8
['A', 'E', 'Z', 'a', 'e', 'z', '\xc9', '\xe9']
['\xe9', '\xc9', 'a', 'A', 'e', 'E', 'z', 'Z']
msg233690 - (view) Author: Ned Deily (ned.deily) * (Python committer) Date: 2015年01月08日 22:26
The initial difference appears to be a long-standing BSD (including OS X) versus GNU/Linux platform difference. See, for example:
http://www.postgresql.org/message-id/18C8A481-33A6-4483-8C24-B8CE70DB7F27@eggerapps.at
Why there is no difference between en and fr UTF-8 is obvious when you look under the covers at the system locale definitions. This is on FreeBSD 10, OS X 10.10 is the same:
$ cd /usr/share/locale/fr_FR.UTF-8/
$ ls -l
total 8
lrwxr-xr-x 1 root wheel 28 Jan 16 2014 LC_COLLATE -> ../la_LN.US-ASCII/LC_COLLATE
lrwxr-xr-x 1 root wheel 17 Jan 16 2014 LC_CTYPE -> ../UTF-8/LC_CTYPE
lrwxr-xr-x 1 root wheel 30 Jan 16 2014 LC_MESSAGES -> ../fr_FR.ISO8859-1/LC_MESSAGES
-r--r--r-- 1 root wheel 36 Jan 16 2014 LC_MONETARY
lrwxr-xr-x 1 root wheel 29 Jan 16 2014 LC_NUMERIC -> ../fr_FR.ISO8859-1/LC_NUMERIC
-r--r--r-- 1 root wheel 364 Jan 16 2014 LC_TIME
For some reason US-ASCII is used for UTF-8 collation; this is also true for en_US.UTF-8 and de_DE.UTF-8, the only other ones I checked.
The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms. But that has never been implemented in Python. Nosing Marc-Andre.
msg233691 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2015年01月08日 22:37
> The postresq discussion and some earlier Python issues suggest using ICU to properly implement Unicode functions like collation across all platforms.
In my experience, the locale module is error-prone and not reliable, especially if you want portability. It just uses functions provided by the OS. And the locales (LC_CTYPE, LC_MESSAGE, etc.) are process-wide which become a major issue if you want to serve different clients using different locales... Windows supports a different locale per thread if I remember correctly.
It would be more reliable to use a good library like ICU. You may try:
https://pypi.python.org/pypi/PyICU
Link showing how to use PyICU to sort a Python sequence:
https://stackoverflow.com/questions/11121636/sorting-list-of-string-with-specific-locale-in-python
=> strings.sort(key=lambda x: collator[loc].getCollationKey(x).getByteArray())
History
Date User Action Args
2022年04月11日 14:58:11adminsetgithub: 67384
2015年01月08日 22:37:48vstinnersetmessages: + msg233691
2015年01月08日 22:27:21ned.deilysettitle: Sorting with locale (strxfrm) does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on BSD or OS X
2015年01月08日 22:26:41ned.deilysetnosy: + lemburg
messages: + msg233690
2015年01月08日 21:48:54r.david.murraysetnosy: + r.david.murray
2015年01月08日 21:48:27r.david.murraylinkissue23196 superseder
2015年01月08日 21:46:17r.david.murraysettitle: Sorting with locale does not work properly with Python3 on Macos -> Sorting with locale (strxfrm) does not work properly with Python3 on Macos
2015年01月08日 21:27:27vstinnersetmessages: + msg233687
2015年01月08日 20:33:56ned.deilysetnosy: + ned.deily
2015年01月08日 20:30:56pnuguescreate

AltStyle によって変換されたページ (->オリジナル) /