homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: PyUnicode_FromWideChar incorrect for characters outside the BMP (unix only)
Type: behavior Stage:
Components: Interpreter Core Versions: Python 2.7
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: Nosy List: lemburg, mark.dickinson, rpetrov, vstinner
Priority: normal Keywords: patch

Created on 2008年11月30日 18:54 by mark.dickinson, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
unicode_fromwidechar_surrogate-7.patch mark.dickinson, 2009年02月28日 16:33
Messages (33)
msg76651 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年11月30日 18:54
On systems (Linux, OS X) where sizeof(wchar_t) is 4 and wchar_t arrays are 
usually encoded as UTF-32, it looks as though PyUnicode_FromWideChar 
simply truncates the 32-bit characters to 16-bits, thus giving incorrect 
results for characters outside the BMP. I expected it to convert the UTF-
32 encoding to UTF-16.
Note that PyUnicode_FromWideChar is used to process command-line 
arguments, so strange things can happen when passing filenames with non-
BMP characters to a Python script.
Here's an OS X 10.5 Terminal session (current directory is the root of the 
py3k tree).
dickinsm$ cat test𐅭.py
from sys import argv
print("My arguments are: ",argv)
dickinsm$ ./python.exe test𐅭.py
My arguments are: ['testŭ.py']
dickinsm$ ./python.exe Lib/tabnanny.py test𐅭.py
'testŭ.py': I/O Error: [Errno 2] No such file or directory: 'testŭ.py'
(In case the character after 'test' and before '.py' isn't showing up 
correctly, it's chr(65901), 'GREEK ACROPHONIC TROEZENIAN FIVE HUNDRED'.)
msg76653 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年11月30日 19:00
Comments from MvL in issue 4388:
> I think you are right. However, conversion to/from wchar_t is/was
> rarely used, and so are non-BMP characters; it's very likely that
> the problem hasn't occurred in practice (and I doubt it would occur
> in 3.0 if not fixed - there are more severe problems around).
msg76665 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008年11月30日 21:27
it is fine on linux (tested with UTF-8 codeset for locale):
$ ./python test𐅭.py
('My arguments are: ', ['test\xf0\x90\x85\xad.py'])
\xf0\x90\x85\xad (UTF-8) = 0001016d (USC-4) = 65901
msg76666 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008年11月30日 21:29
s/USC-4/UCS-4/g
msg76670 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年11月30日 22:22
> it is fine on linux
Interesting. Which version of Python is that? And is PyUNICODE 2 bytes 
or 4 bytes for that build of Python?
msg76684 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2008年12月01日 10:53
Just to be clear, the defect in PyUnicode_FromWideChar is present both in 
Python 2.x and Python 3.x.
The problem with command-line arguments only occurs in Python 3.x, since 
2.x doesn't use PyUnicode_FromWideChar in converting arguments.
I can reproduce the 'No such file or directory' error on both OS X and 
Linux, for Python 3.0.
msg76686 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2008年12月01日 11:07
This is due to the function downcasting the wchar_t values to
Py_UNICODE, which is a 2-byte value if you build Python as UCS2 version
on Unix.
Most Unixes ship with UCS4 builds, so you don't see the problem there.
Mac OS X ships with a UCS2 build, which is why you run into the problem
on that platform.
UCS2 builds are also the default build on Unix, so if you compile Python
yourself, it will result in a UCS2 build, unless you explicitly specify
the UCS4 build configure option or configure happens to find a Tcl/Tk
version installed that uses UCS4 internally.
msg76708 - (view) Author: Roumen Petrov (rpetrov) * Date: 2008年12月01日 22:04
Marc-Andre explain all. For the protocol my version is from trunk,
python is build with default options. Since system tcl limit UTF-8 to 3
bytes, python is build for UCS-2.
In the report output from python is with character 010d(UCS-2).
May be issue is not for versions before 3.0.
msg80001 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 03:58
Patch fixing PyUnicode_FromWideChar() for UCS-2 build: create 
surrogates for character > U+FFFF like PyUnicode_FromOrdinal() does.
msg80002 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 04:00
Note: I wrote my patch against py3k r68646.
msg80008 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年01月17日 09:41
Thanks for the patch, Victor!
Looks pretty good at first glance, except that it seems that the UTF-32 to 
UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that 
deliberate?
A test would be good, too.
msg80012 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 13:00
> Looks pretty good at first glance, except that it seems that the UTF-32 to
> UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
> deliberate?
#ifdef HAVE_USABLE_WCHAR_T
 memcpy(unicode->str, w, size * sizeof(wchar_t));
#else
 ...
#endif
I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I 
misunderstood the code, it's a a heap overflow :-) So there is no not 
conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is 
defined, right?
> A test would be good, too.
PyUnicode_FromWideChar() is not a public API. Should I write a function in 
_testcapi?
msg80013 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年01月17日 13:18
> I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I 
> misunderstood the code, it's a a heap overflow :-)
Yep, sorry. You're right.
>> A test would be good, too.
>
>PyUnicode_FromWideChar() is not a public API. Should I write a function 
>in _testcapi?
I was actually thinking of a test for the "No such file or directory" 
bug that's mentioned at the start of the discussion.
msg80017 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年01月17日 15:12
On 2009年01月17日 14:00, STINNER Victor wrote:
>> A test would be good, too.
> 
> PyUnicode_FromWideChar() is not a public API. Should I write a function in 
> _testcapi?
It is a public C API. Regardless of this aspect, we should always
add tests for bugs in order to make sure they don't pop up again.
msg80018 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年01月17日 15:19
On 2009年01月17日 14:00, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
>> Looks pretty good at first glance, except that it seems that the UTF-32 to
>> UTF-16 translation is skipped if HAVE_USABLE_WCHAR_T is defined. Is that
>> deliberate?
> 
> #ifdef HAVE_USABLE_WCHAR_T
> memcpy(unicode->str, w, size * sizeof(wchar_t));
> #else
> ...
> #endif
> 
> I understand this code as: sizeof(wchar_t) == sizeof(Py_UNICODE). If I 
> misunderstood the code, it's a a heap overflow :-) So there is no not 
> conversion from UTF-32 to UTF-16 using memcpy if HAVE_USABLE_WCHAR_T is 
> defined, right?
If HAVE_USABLE_WCHAR_T is defined, Py_UNICODE is defined as wchar_t,
so a memcpy can be used. Note that this does not provide any information
about sizeof(wchar_t), e.g. with GLIBC, wchar_t is 4 bytes. MS C lib defines
it as 2 bytes.
That said, if Py_UNICODE is the same as wchar_t, no conversion is
necessary and that's why the function simply copies over the data.
msg80021 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 16:53
Updated patch including a test in _testcapi module: create two 
PyUnicode objects from wide string (PyUnicode_FromWideChar) and 
another from utf-8 (PyUnicode_FromString) and compare the value. Patch 
is still for py3k branch and can be ported easily on python trunk.
msg80023 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 17:07
I run my test on py3k on Linux with 32 bits wchar_t:
 - 16 bits Py_UNICODE: test is failing without 
PyUnicode_FromWideChar() patch
 - 32 bits Py_UNICODE: test pass without the patch, so the issue only 
impact 16 bits Py_UNICODE
Can someone test with 16 bits wchar_t (eg. Windows)? I think that the 
bug only impact 16 bits Py_UNICODE with 32 bits wchar_t.
msg80024 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月17日 17:09
(with the full patch, all tests pass with 16 or 32 bits Py_UNICODE)
msg80132 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年01月18日 21:59
Looks good to me.
I'm not in a position to test with 16-bit wchar_t, but I can't see why 
anything would go wrong. I think we can take our chances: check this in 
and watch the buildbots for signs of trouble.
Some minor whitespace issues in the unicodeobject.c part of the patch 
(mixing of tabs and spaces, one brace indented oddly), but those can 
easily be taken care of before committing; not worth regenerating the 
patch for.
Marc-André, is it okay with you to check this in?
msg80151 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年01月19日 09:52
On 2009年01月18日 22:59, Mark Dickinson wrote:
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> Looks good to me.
> 
> I'm not in a position to test with 16-bit wchar_t, but I can't see why 
> anything would go wrong. I think we can take our chances: check this in 
> and watch the buildbots for signs of trouble.
> 
> Some minor whitespace issues in the unicodeobject.c part of the patch 
> (mixing of tabs and spaces, one brace indented oddly), but those can 
> easily be taken care of before committing; not worth regenerating the 
> patch for.
> 
> Marc-André, is it okay with you to check this in?
I'd structure the patch differently, ie. put the whole support code
into a single #ifndef Py_UNICODE_WIDE section as part of the
#ifdef HAVE_USABLE_WCHAR_T pre-processor statement.
Also note that on platforms with 16-bit wchar_t, the comparison
(0xffff < *w) will always be false, so an additional check for
(Py_UNICODE_SIZE > 2) is needed.
BTW: Please always use upper-case hex literals, or at leat don't
mix the case within the same function.
Thanks,
-- 
Marc-Andre Lemburg
eGenix.com
________________________________________________________________________
::: Try our new mxODBC.Connect Python Database Interface for free ! ::::
 eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48
 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
 Registered at Amtsgericht Duesseldorf: HRB 46611
 http://www.egenix.com/company/contact/ 
msg80311 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月21日 02:05
> Also note that on platforms with 16-bit wchar_t, the comparison
> (0xffff < *w) will always be false, so an additional check for
> (Py_UNICODE_SIZE > 2) is needed.
Yes, but the right test is (SIZEOF_WCHAR_T > 2). I wrote a new test:
#if (Py_UNICODE_SIZE == 2) && (SIZEOF_WCHAR_T > 2)
#define USE_WCHAR_SURROGATE
 const wchar_t *orig_w;
#endif
> BTW: Please always use upper-case hex literals, or at leat don't
> mix the case within the same function.
I try, but it would be easier if the rule was already respected: they 
are many tabs and many lower-case hex literals. I used copy/paste from 
existing code of unicodeobject.c...
Patch version 3:
 - disable the UTF-16 surrogate for 16 bits wchar_t: so my patch is 
only used for 16 bits Py_UNICODE and 32 bits wchar_t... which is the 
default case for python 2.6 and 3.0 on Linux
 - replace tabulation by spaces (in existing code)
 - use upper case literals
msg80578 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月26日 16:56
@marketdickinson, @lemburg: ping! I updated the patch, does it look 
better?
msg80636 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年01月27日 08:49
On 2009年01月26日 17:56, STINNER Victor wrote:
> STINNER Victor <victor.stinner@haypocalc.com> added the comment:
> 
> @marketdickinson, @lemburg: ping! I updated the patch, does it look 
> better?
Yes, but there are a few things that still need fixing:
 * SIZEOF_WCHAR_T is not defined for Windows builds, so needs
 to be added to PC/pyconfig.h (OTOH wchar_t is 2 bytes on Windows)
 * USE_WCHAR_SURROGATE should be #defined just before the
 function and #undef'ed right after it; I'd also use a more
 accurate name
 * please use pre-processor indents, e.g.
 #ifdef ...
 # define ...
 #endif
I'd write
#if (Py_UNICODE_SIZE == 2) && defined((SIZEOF_WCHAR_T) && (SIZEOF_WCHAR_T > 2)
# define CONVERT_WCHAR_TO_SURROGATES
#endif
...
#undef CONVERT_WCHAR_TO_SURROGATES
Thanks.
msg80795 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年01月29日 23:36
For lemburg, updated patch:
 - Move USE_WCHAR_SURROGATE define outside PyUnicode_FromWideChar() 
(and indent the defines, sorry)
 - Add "#define SIZEOF_WCHAR_T 2" to PC/pyconfig.h 
msg82674 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年02月24日 19:38
Updated Victor's patch:
 - applies cleanly against newly whitespace-normalized unicodeobject.c
 - renamed USE_WCHAR_SURROGATE to CONVERT_WCHAR_TO_SURROGATES
 - add defined(SIZEOF_WCHAR_T) check
I find the patched version of PyUnicode_FromWideChar quite hard to follow 
with all the #ifdefery. I wonder whether it might be better to define a 
_PyUnicode16_FromWideChar32 helper function instead.
msg82676 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年02月24日 19:52
On 2009年02月24日 20:39, Mark Dickinson wrote:
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> Updated Victor's patch:
> 
> - applies cleanly against newly whitespace-normalized unicodeobject.c
> - renamed USE_WCHAR_SURROGATE to CONVERT_WCHAR_TO_SURROGATES
> - add defined(SIZEOF_WCHAR_T) check
> 
> I find the patched version of PyUnicode_FromWideChar quite hard to follow 
> with all the #ifdefery. I wonder whether it might be better to define a 
> _PyUnicode16_FromWideChar32 helper function instead.
Same here. It would be better to have a single #ifdef #else #endif
block with one branch for the CONVERT_WCHAR_TO_SURROGATES case and the
other for the normal operation.
No need for a new helper function.
msg82677 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年02月24日 19:56
> It would be better to have a single #ifdef #else #endif
Yes, of course it would. :)
msg82678 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年02月24日 20:49
New patch, with two separate versions of PyUnicode_FromWideChar.
msg82679 - (view) Author: Marc-Andre Lemburg (lemburg) * (Python committer) Date: 2009年02月24日 20:58
On 2009年02月24日 21:50, Mark Dickinson wrote:
> Mark Dickinson <dickinsm@gmail.com> added the comment:
> 
> New patch, with two separate versions of PyUnicode_FromWideChar.
Thanks, much better :-)
msg82789 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2009年02月26日 23:40
> add defined(SIZEOF_WCHAR_T) check
I don't understand why SIZEOF_WCHAR_T could be unset, but the patch 
version 6 only checks defined(SIZEOF_WCHAR_T) in unicodeobject.c, not 
in _testcapimodule.c (#if SIZEOF_WCHAR_T == 4).
msg82919 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年02月28日 16:33
Good catch! Added defined(SIZEOF_WCHAR) to the testcapi code as well,
and removed the change to PC/pyconfig.h, since we don't need it any 
more...
msg83751 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年03月18日 14:49
Committed to py3k, r70452.
Since this is partway between a bugfix and a new feature, I suggest that 
it's not worth merging it to 3.0 (or 2.6). It should be backported to 
2.7, however; I'll do this after verifying that the py3k buildbots are 
happy.
msg83754 - (view) Author: Mark Dickinson (mark.dickinson) * (Python committer) Date: 2009年03月18日 16:10
Backported to the trunk in r70454. Thanks, all!
History
Date User Action Args
2022年04月11日 14:56:41adminsetgithub: 48724
2009年03月18日 16:10:55mark.dickinsonsetstatus: open -> closed
resolution: fixed
messages: + msg83754
2009年03月18日 14:49:45mark.dickinsonsetmessages: + msg83751
versions: - Python 2.6, Python 3.0, Python 3.1
2009年03月01日 13:04:20vstinnersetfiles: - unicode_fromwidechar_surrogate-6.patch
2009年02月28日 16:33:39mark.dickinsonsetfiles: + unicode_fromwidechar_surrogate-7.patch
messages: + msg82919
2009年02月26日 23:40:03vstinnersetmessages: + msg82789
2009年02月26日 23:37:00vstinnersetfiles: - unicode_fromwidechar_surrogate-4.patch
2009年02月26日 23:36:57vstinnersetfiles: - unicode_fromwidechar_surrogate-5.patch
2009年02月24日 20:58:13lemburgsetmessages: + msg82679
2009年02月24日 20:49:55mark.dickinsonsetfiles: + unicode_fromwidechar_surrogate-6.patch
messages: + msg82678
2009年02月24日 19:56:29mark.dickinsonsetmessages: + msg82677
2009年02月24日 19:52:25lemburgsetmessages: + msg82676
2009年02月24日 19:38:50mark.dickinsonsetfiles: + unicode_fromwidechar_surrogate-5.patch
messages: + msg82674
2009年01月29日 23:36:25vstinnersetfiles: - unicode_fromwidechar_surrogate-3.patch
2009年01月29日 23:36:19vstinnersetfiles: + unicode_fromwidechar_surrogate-4.patch
messages: + msg80795
2009年01月27日 08:49:10lemburgsetmessages: + msg80636
2009年01月26日 16:56:58vstinnersetmessages: + msg80578
2009年01月21日 02:05:17vstinnersetfiles: - unicode_fromwidechar_surrogate-2.patch
2009年01月21日 02:05:10vstinnersetfiles: + unicode_fromwidechar_surrogate-3.patch
messages: + msg80311
2009年01月19日 09:52:24lemburgsetmessages: + msg80151
2009年01月18日 21:59:13mark.dickinsonsetmessages: + msg80132
2009年01月17日 17:09:16vstinnersetmessages: + msg80024
2009年01月17日 17:07:16vstinnersetfiles: - unicode_fromwidechar_surrogate.patch
2009年01月17日 17:07:11vstinnersetmessages: + msg80023
2009年01月17日 16:53:33vstinnersetfiles: + unicode_fromwidechar_surrogate-2.patch
messages: + msg80021
2009年01月17日 15:19:15lemburgsetmessages: + msg80018
2009年01月17日 15:12:53lemburgsetmessages: + msg80017
2009年01月17日 13:18:35mark.dickinsonsetmessages: + msg80013
2009年01月17日 13:00:02vstinnersetmessages: + msg80012
2009年01月17日 09:41:43mark.dickinsonsetmessages: + msg80008
2009年01月17日 04:00:33vstinnersetmessages: + msg80002
2009年01月17日 03:58:33vstinnersetfiles: + unicode_fromwidechar_surrogate.patch
keywords: + patch
messages: + msg80001
2008年12月01日 23:07:55vstinnersetnosy: + vstinner
2008年12月01日 22:04:28rpetrovsetmessages: + msg76708
2008年12月01日 11:07:48lemburgsetnosy: + lemburg
messages: + msg76686
2008年12月01日 10:53:23mark.dickinsonsetmessages: + msg76684
2008年11月30日 22:22:04mark.dickinsonsetmessages: + msg76670
2008年11月30日 21:29:45rpetrovsetmessages: + msg76666
2008年11月30日 21:27:55rpetrovsetnosy: + rpetrov
messages: + msg76665
2008年11月30日 19:35:19loewissetversions: + Python 2.6, Python 3.0, Python 2.7
2008年11月30日 19:00:26mark.dickinsonsetmessages: + msg76653
2008年11月30日 18:54:07mark.dickinsoncreate

AltStyle によって変換されたページ (->オリジナル) /