This issue tracker has been migrated to GitHub ,
and is currently read-only.
For more information,
see the GitHub FAQs in the Python's Developer Guide.
Created on 2010年05月09日 06:38 by stutzbach, last changed 2022年04月11日 14:57 by admin. This issue is now closed.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | |
| aswidechar_nonbmp-4.patch | vstinner, 2010年10月01日 22:56 | |||
| ctypes_nonbmp.patch | vstinner, 2010年10月01日 23:02 | |||
| Messages (20) | |||
|---|---|---|---|
| msg105374 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年05月09日 06:38 | |
Using a UCS2 Python on a platform with a 32-bit wchar_t, the following code throws an exception (but should not):
>>> ctypes.c_wchar('\u10000')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: one character unicode string expected
The trouble is in the u_set() function in Modules/_ctypes/cfield.c. The corresponding u_get() function looks correct.
On a UCS4 Python running on a system with a 16-bit wchar_t, u_set() will corrupt the data by silently truncating the character to 16-bits.
For reference, Linux and Mac OS use a 32-bit wchar_t while Windows uses a 16-bit wchar_t.
|
|||
| msg105656 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月13日 21:10 | |
Support of characters outside the Unicode BMP (code > 0xffff) is not complete in narrow build (sizeof(Py_UNICODE) == 2) for Python2:
$ ./python
Python 2.7b2+ (trunk:81139M, May 13 2010, 18:45:37)
>>> x=u'\U00010000'
>>> x[0], x[1]
(u'\ud800', u'\udc00')
>>> len(x)
2
>>> ord(x)
Traceback (most recent call last):
...
TypeError: ord() expected a character, but string of length 2 found
>>> unichr(0x10000)
Traceback (most recent call last):
...
ValueError: unichr() arg not in range(0x10000) (narrow Python build)
It looks better in Python3:
$ ./python
Python 3.2a0 (py3k:81137:81138, May 13 2010, 18:50:51)
>>> x='\U00010000'
>>> x[0], x[1]
('\ud800', '\udc00')
>>> len(x)
2
>>> ord(x)
65536
>>> chr(0x10000)
'\U00010000'
About the issue, the problem is in function u_set(). This function should use PyUnicode_AsWideChar() but PyUnicode_AsWideChar() doesn't support surrogates... whereas PyUnicode_FromWideChar() does support surrogates.
|
|||
| msg105670 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年05月13日 23:56 | |
Patch for Python3:
- Fix PyUnicode_AsWideChar() to support surrogates (Py_UNICODE: 2 bytes, wchar_t: 4 bytes)
- u_set() of _ctypes uses PyUnicode_AsWideChar()
- add a test (skipped if sizeof(wchar_t) is smaller than 4 bytes)
It's too late to fix Python2: 2.7 beta 2 was released (it doesn't support non BMP characters for chr()/ord()).
TODO:
- I'm not sure that my patch on PyUnicode_AsWideChar() works if sizeof(wchar_t)==2
- Test the patch on Windows
- Check that it doesn't break other functions calling PyUnicode_AsWideChar()
ctypes, _locale.strcoll() and time.strftime() use PyUnicode_AsWideChar(). time has interesting comments:
/* This assumes that PyUnicode_AsWideChar doesn't do any UTF-16
expansion. */
if (PyUnicode_AsWideChar((PyUnicodeObject*)format,
(wchar_t*)PyBytes_AS_STRING(tmpfmt),
PyUnicode_GetSize(format)+1) == (size_t)-1)
/* This shouldn't fail. */
Py_FatalError("PyUnicode_AsWideChar failed");
|
|||
| msg117567 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年09月29日 00:21 | |
#9979 proposes to create a new PyUnicode_AsWideCharString() function. |
|||
| msg117568 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年09月29日 00:27 | |
I know enough about Unicode to have reported this bug, but I don't feel knowledgeable enough about Python's Unicode implementation to comment on your suggested solution. I'm adding the other people listed in Misc/maintainers.rst as interested in Unicode to the nosy list on this issue and the one you just linked to. |
|||
| msg117787 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月01日 11:53 | |
Update the patch for the new PyUnicode_AsWideCharString() function: - use Py_UNICODE_SIZE and SIZEOF_WCHAR_T in the preprocessor tests - faster loop: don't use a counter + pointer, but only use pointers (for the stop condition) The patch is not finished: I have to implement "#elif Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2" case. |
|||
| msg117789 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月01日 12:24 | |
Patch version 3: - fix unicode_aswidechar if Py_UNICODE_SIZE == SIZEOF_WCHAR_T and w == NULL (return the number of characters, don't write into w!) - improve unicode_aswidechar() comment |
|||
| msg117790 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月01日 12:28 | |
I don't know how to test "if Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2". On Windows, sizeof(wchar_t) is 2, but it looks like Python is not prepared to have Py_UNICODE != wchar_t for is Windows implementation. wchar_t is 32 bits long on Linux and Mac OS X. So how can I test it? Or should we just drop support of "Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2"? |
|||
| msg117791 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年10月01日 12:44 | |
I, too, can't think of any platforms where Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2 and I'm not sure what the previous policy has been. Have you noticed any other code that would set a precedent? If no one else chimes in, perhaps ask on IRC or python-dev? |
|||
| msg117792 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年10月01日 12:46 | |
STINNER Victor wrote: > > STINNER Victor <victor.stinner@haypocalc.com> added the comment: > > I don't know how to test "if Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2". On Windows, sizeof(wchar_t) is 2, but it looks like Python is not prepared to have Py_UNICODE != wchar_t for is Windows implementation. > > wchar_t is 32 bits long on Linux and Mac OS X. So how can I test it? Or should we just drop support of "Py_UNICODE_SIZE == 4 && SIZEOF_WCHAR_T == 2"? You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to test drive this case. But it's probably easier to configure with "gcc -fshort-wchar" on Linux :-) Dropping support for this is not a good idea. |
|||
| msg117795 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年10月01日 13:01 | |
> You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to > test drive this case. I seem to recall seeing some other code that assumed Windows implied UCS2. Proceed with caution. ;-) > But it's probably easier to configure with "gcc -fshort-wchar" on > Linux :-) libc will still be using sizeof(wchar_t) == 4, though. Won't that cause Bad Things to happen when calling libc wide-character functions? |
|||
| msg117796 - (view) | Author: Marc-Andre Lemburg (lemburg) * (Python committer) | Date: 2010年10月01日 13:06 | |
Daniel Stutzbach wrote: > > Daniel Stutzbach <daniel@stutzbachenterprises.com> added the comment: > >> You can tweak the Windows pyconfig.h to use UCS4, AFAIK, if you want to >> test drive this case. > > I seem to recall seeing some other code that assumed Windows implied UCS2. Proceed with caution. ;-) Probably, yes. I've never tried it myself. >> But it's probably easier to configure with "gcc -fshort-wchar" on >> Linux :-) > > libc will still be using sizeof(wchar_t) == 4, though. Won't that cause Bad Things to happen when calling libc wide-character functions? Sure, but this is just about testing an interface, not running applications :-) Here's what the GCC man-page has to say: -fshort-wchar Override the underlying type for wchar_t to be short unsigned int instead of the default for the target. This option is useful for building programs to run under WINE. Warning: the -fshort-wchar switch causes GCC to generate code that is not binary compatible with code generated without that switch. Use it to conform to a non-default application binary interface. |
|||
| msg117845 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月01日 22:56 | |
Patch version 4: - implement unicode_aswidechar() for 16 bits wchar_t and 32 bits Py_UNICODE - PyUnicode_AsWideWcharString() returns the number of wide characters excluding the nul character as does PyUnicode_AsWideChar() For 16 bits wchar_t and 32 bits Py_UNICODE, I extracted the "as wide char" unicode functions into a small C file and compiled it with -fshort-wchar on Linux. |
|||
| msg117848 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月01日 23:02 | |
Ooops, I lost my patch to fix the initial (ctypes) issue. Here is an updated patch: ctypes_nonbmp.patch (which needs aswidechar_nonbmp-4.patch). |
|||
| msg117866 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月02日 11:17 | |
r85172 changes PyUnicode_AsWideCharString() (don't count the trailing nul character in the output size) and add unit tests. r85173 patches unicode_aswidechar() to supports non-BMP characters for all known wchar_t/Py_UNICODE size combinaisons (2/2, 2/4 and 4/2). I noticed that PyUnicode_AsWideChar() and PyUnicode_AsWideCharString() accept embeded nul characters. I don't know if it is a bug or an expected behaviour. Anyway, there is now a test for this case. |
|||
| msg117868 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月02日 11:55 | |
r85174+r85177: ctypes.c_wchar supports non-BMP characters with 32 bits wchar_t => fix this issue (I commited also an unwanted change on _testcapi to fix r85172 in r85174: r85175 reverts this change, and r85176 fixes the _testcapi bug again) |
|||
| msg117869 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月02日 11:59 | |
> r85173 patches unicode_aswidechar() to supports non-BMP characters > for all known wchar_t/Py_UNICODE size combinaisons (2/2, 2/4 and 4/2). Oh, and 4/4 ;-) |
|||
| msg117872 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年10月02日 13:15 | |
Thanks for working on this! Since this was a bugfix, it should be merged back into 2.7, yes? |
|||
| msg117882 - (view) | Author: STINNER Victor (vstinner) * (Python committer) | Date: 2010年10月02日 15:29 | |
> Since this was a bugfix, it should be merged back into 2.7, yes? Mmmh, the fix requires to change PyUnicode_AsWideChar() function (support non-BMP characters and surrogate pairs) (and maybe also to create PyUnicode_AsWideCharString()). I don't really want to change such important function in a stable branch (Python2). Is it really important to support non-BMP characters for ctypes.c_wchar in Python2? I would like to say: if you want better unicode support, use Python 3. And Python 3.2 if it's possible :-) |
|||
| msg117883 - (view) | Author: Daniel Stutzbach (stutzbach) (Python committer) | Date: 2010年10月02日 15:33 | |
Since I noticed the bug through source code inspection and no one has reported it occurring in practice, that sounds reasonable to me. |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2022年04月11日 14:57:00 | admin | set | github: 52916 |
| 2010年10月02日 15:33:48 | stutzbach | set | messages:
+ msg117883 versions: - Python 2.7 |
| 2010年10月02日 15:29:09 | vstinner | set | messages: + msg117882 |
| 2010年10月02日 13:15:31 | stutzbach | set | messages:
+ msg117872 stage: test needed -> resolved |
| 2010年10月02日 11:59:14 | vstinner | set | messages: + msg117869 |
| 2010年10月02日 11:55:58 | vstinner | set | status: open -> closed resolution: fixed messages: + msg117868 |
| 2010年10月02日 11:17:32 | vstinner | set | messages: + msg117866 |
| 2010年10月01日 23:02:37 | vstinner | set | files:
+ ctypes_nonbmp.patch messages: + msg117848 |
| 2010年10月01日 22:56:28 | vstinner | set | files: - aswidechar_nonbmp-3.patch |
| 2010年10月01日 22:56:25 | vstinner | set | files: - aswidechar_nonbmp-2.patch |
| 2010年10月01日 22:56:16 | vstinner | set | files:
+ aswidechar_nonbmp-4.patch messages: + msg117845 |
| 2010年10月01日 13:06:02 | lemburg | set | messages: + msg117796 |
| 2010年10月01日 13:01:40 | stutzbach | set | messages: + msg117795 |
| 2010年10月01日 12:46:53 | lemburg | set | messages:
+ msg117792 title: c_types.c_wchar should not assume that sizeof(wchar_t) == sizeof(Py_UNICODE) -> c_types.c_wchar should not assume that sizeof(wchar_t) == sizeof(Py_UNICODE) |
| 2010年10月01日 12:44:59 | stutzbach | set | messages: + msg117791 |
| 2010年10月01日 12:28:35 | vstinner | set | messages: + msg117790 |
| 2010年10月01日 12:24:56 | vstinner | set | files:
+ aswidechar_nonbmp-3.patch messages: + msg117789 |
| 2010年10月01日 11:53:40 | vstinner | set | files: - pyunicode_aswidechar_surrogates-py3k.patch |
| 2010年10月01日 11:53:25 | vstinner | set | files:
+ aswidechar_nonbmp-2.patch messages: + msg117787 |
| 2010年09月29日 00:27:50 | stutzbach | set | nosy:
+ lemburg, ezio.melotti messages: + msg117568 |
| 2010年09月29日 00:21:32 | vstinner | set | messages: + msg117567 |
| 2010年05月13日 23:56:52 | vstinner | set | files:
+ pyunicode_aswidechar_surrogates-py3k.patch keywords: + patch messages: + msg105670 |
| 2010年05月13日 21:10:20 | vstinner | set | messages: + msg105656 |
| 2010年05月13日 01:00:48 | vstinner | set | nosy:
theller, vstinner, stutzbach components: + Unicode |
| 2010年05月13日 01:00:42 | vstinner | set | nosy:
+ vstinner |
| 2010年05月09日 06:38:33 | stutzbach | create | |