Issue 13072: Getting a buffer from a Unicode array uses invalid format

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/57281

classification

Title:	Getting a buffer from a Unicode array uses invalid format
Type:	behavior	Stage:	resolved
Components:	Library (Lib)	Versions:	Python 3.3

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	vstinner	Nosy List:	Arfrever, georg.brandl, loewis, mark.dickinson, meador.inge, ncoghlan, pitrou, python-dev, skrah, vstinner
Priority:	release blocker	Keywords:	patch

Created on 2011年09月30日 00:09 by vstinner, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
array_revert_pep393.patch	vstinner, 2012年08月01日 10:19	review
array_revert_pep393-2.patch	vstinner, 2012年08月01日 12:45	review
array_unicode_format.patch	vstinner, 2012年08月05日 23:05	review
array_deprecate_u.diff	skrah, 2012年08月19日 11:26	review

Messages (45)
msg144658 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年09月30日 00:09
In Python 3.2, when you get a buffer from array.array('u'), "u" is used as buffer format. The format is supposed to be a format from the struct module, and "u" is an invalid struct format. "w" is used on wide mode. I just upgraded the array module to use the new Unicode API (PEP 393). The array now uses a Py_UCS4 buffer. I used "I" or "L" format depending on the size of int and long C types. It would be better to use a format for a Py_UCS4 string, but struct doesn't support such type. For Python 2.7 and 3.2, I don't know if it is really a bug or not.
msg144812 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2011年10月03日 10:34
The automatic conversion of 'u' to 'I' or 'L' causes test_buffer (PEP-3118 repo) to fail: # Not implemented formats. Ugly, but inevitable. This is the same as # issue #2531: equality is also used for membership testing and must # return a result. a = array.array('u', 'xyz') v = memoryview(a) self.assertNotEqual(v, a) self.assertNotEqual(a, v) I don't have a better idea though what to do about 'u' except officially implementing it for struct and memoryview as well.
msg144814 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2011年10月03日 10:52
>It would be better to use a format for a Py_UCS4 string, but struct doesn't support such type. PEP-3118 suggests for the extended struct syntax: 'c' -> ucs-1 (latin-1) encoding 'u' -> ucs-2 'w' -> ucs-4
msg144817 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2011年10月03日 13:34
> The automatic conversion of 'u' to 'I' or 'L' causes test_buffer > (PEP-3118 repo) to fail: > > > # Not implemented formats. Ugly, but inevitable. This is the same as > # issue #2531: equality is also used for membership testing and must > # return a result. > a = array.array('u', 'xyz') > v = memoryview(a) > self.assertNotEqual(v, a) > self.assertNotEqual(a, v) I don't understand: a buffer format is a format for the struct module, or for the array module?
msg144818 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2011年10月03日 14:00
STINNER Victor <report@bugs.python.org> wrote: > > # Not implemented formats. Ugly, but inevitable. This is the same as > > # issue #2531: equality is also used for membership testing and must > > # return a result. > > a = array.array('u', 'xyz') > > v = memoryview(a) > > self.assertNotEqual(v, a) > > self.assertNotEqual(a, v) > > I don't understand: a buffer format is a format for the struct module, > or for the array module? It's like this: memoryview follows the current struct syntax, which doesn't have 'u'. memory_richcompare() does not understand 'u', but is required to return something for __eq__ and __ne__, so it returns 'not equal'. This isn't so important, since I discovered (see my later post) that 'u' and 'w' were scheduled for inclusion in the struct module anyway. So I think we should focus on whether the proposed 'c', 'u' and 'w' format specifiers still make sense after the PEP-393 changes.
msg158381 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年04月16日 00:19
@Stefan: What is the status of this issue?
msg158892 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年04月20日 21:14
I'm not sure what to do. Martin's opinion was that the change should be reverted: http://mail.python.org/pipermail/python-dev/2012-March/117390.html
msg167091 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月01日 06:59
Should we do something before Python 3.3 final?
msg167109 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月01日 10:07
Is it possible without too much effort to keep the old behavior ('u' -> Py_UNICODE)? Then I'd say that should go into 3.3. The problem with the current behavior is that it's neither backwards compatible nor PEP-3118 compliant. If it is too much work to restore the status quo, we could leave this change with the excuse that 'u' is probably not used very often.
msg167112 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月01日 10:19
Here is a patch reverting changes of the PEP 393, as suggested by Martin von Loewis. With the patch, array uses Py_UNICODE* type for the 'u' format. So array.array('u', '\u0010ffff')[0] should return '\uDBFF' on Windows.
msg167119 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月01日 12:16
The diff between b9558df8cc58 and default with array_revert_pep393.patch applied is small, but I noticed that in some places you switched back to Py_UNICODE typecode and in others not. For instance, in struct arraydescr typecode is still char. I'm not sure why typecode was originally Py_UNICODE though.
msg167122 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月01日 12:45
> The diff between b9558df8cc58 and default with array_revert_pep393.patch > applied is small, but I noticed that in some places you switched back to > Py_UNICODE typecode and in others not. I just copied code from Python 3.2, I forgot to update typecode type (Py_UNICODE => char). I attach a new patch which changes also the documentation of the "u" format.
msg167165 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月01日 19:29
array_revert_pep393-2.patch looks good (checked against 7042a83f37e and all following commits that should be kept).
msg167173 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月01日 22:15
@Georg: are you ok with this change? It reverts the behaviour of Python 3.2 and avoids to have to maintain an API that nobody wants to use ('u' format using Py_UCS4, 32 bits unsigned).
msg167520 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年08月05日 22:54
New changeset 95da47ddebe0 by Victor Stinner in branch 'default': Close #13072: Restore code before the PEP 393 for the array module http://hg.python.org/cpython/rev/95da47ddebe0
msg167521 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月05日 23:05
Oops, the initial issue is not solved. Attached fixes the array == memoryview issue by using a valid format for the buffer.
msg167522 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月05日 23:07
Hum, this issue is a regression from Python 3.2. I would like to see it fixed in Python 3.3. Example: Python 3.2.3+ (3.2:243ad1a6f638+, Aug 4 2012, 01:36:41) [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2 >>> import array >>> a=array.array('u', 'xyz') >>> b=memoryview(a) >>> a == b True >>> b == a True
msg167540 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2012年08月06日 05:47
Victor: the revert commit brought back "Python's Unicode character type" into the docs. This needs to be fixed to say "legacy" somewhere, as the characters in a normal Unicode string are not of that type anymore.
msg167545 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月06日 08:47
STINNER Victor <report@bugs.python.org> wrote: > Hum, this issue is a regression from Python 3.2. > > Python 3.2.3+ (3.2:243ad1a6f638+, Aug 4 2012, 01:36:41) > [GCC 4.6.3 20120306 (Red Hat 4.6.3-2)] on linux2 > >>> import array > >>> a=array.array('u', 'xyz') > >>> b=memoryview(a) > >>> a == b > True > >>> b == a > True [3.3 returns False] That's actually deliberate. The new memoryview does not consider arrays equal if the format codes do not match, to avoid situations like (32-bit Python): Python 3.2a0 (py3k:76143M, Nov 7 2009, 17:05:38) [GCC 4.2.1] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import array >>> a = array.array('f', [0]) >>> b = array.array('i', [0]) >>> x = memoryview(a) >>> y = memoryview(b) >>> >>> a == b True >>> x == y True >>> I think that (for buffers at least) an array of float should not compare equal to an array of int, especially since the 3.2 memoryview uses memcmp() in richcompare(). See also the comment in the documentation for memoryview.format: http://docs.python.org/dev/library/stdtypes.html#memoryview-type memoryview is not aware of the 'u' format code, since it's not part of the struct syntax and the PEP-3118 proposition 'u' -> UCS2, 'w' -> UCS4 wasn't considered useful. Now in your example I see that array's getbufferproc actually already uses 'w' for UCS4. It would still be an option to make memoryview aware of 'u' and 'w' (as suggested by the PEP).
msg167546 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月06日 09:07
Also, it was suggested that 'u' should be deprecated: http://mail.python.org/pipermail/python-dev/2012-March/117392.html Personally, I don't have an opinion on that; I don't use the 'u' format code. Nick, could you have a look at msg167545 and see if any action should be taken?
msg167547 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月06日 09:26
Of course, if two formats are the same, it is possible to use memcmp(). I'll work on a patch.
msg167549 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2012年08月06日 09:52
Perhaps if memoryview doesn't understand the format code, it can fall back on memcmp() if strcmp() indicates the format codes are the same? Otherwise we're at risk of breaking backwards compatibility with more than just array('u'). Also, if it isn't already, the change to take format codes into a account in memoryview comparisons should be mentioned in the What's New porting section.
msg167551 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2012年08月06日 10:19
> memoryview is not aware of the 'u' format code, since it's not part of > the struct syntax and the PEP-3118 proposition 'u' -> UCS2, 'w' -> UCS4 > wasn't considered useful. Did you see attached patch array_unicode_format.patch? It uses struct format "H" or "I" depending on the size of wchar_t.
msg167561 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月06日 13:19
> Did you see attached patch array_unicode_format.patch? It uses struct > format "H" or "I" depending on the size of wchar_t. I totally overlooked that. Given that memoryview can be fixed to compare buffers with unknown formats, I don't have a strong opinion on whether array's getbufferproc should alter the format codes of 'u' and 'w' or not. The only advantage for memoryview would be that tolist() etc. would work. However, tolist() previously only worked for bytes, so in this case raising an exception for 'u' and 'w' is not a regression but an improvement. :) If we're deprecating 'u' and 'w' anyway, the getbufferproc should probably continue to return 'u' and 'w' until the removal of these format codes.
msg167566 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2012年08月06日 14:02
I think Victor's patch is a good solution to killing the 'u' and 'w' exports in 3.4, but we need to restore some tolerance for unknown format codes to memoryview in 3.3 regardless.
msg167571 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月06日 19:48
I have a patch already for the unknown format codes in memoryview. Currently fighting (as usual) with the case explosions in the tests. I think I can have a full patch by next weekend.
msg167673 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年08月08日 07:40
Someone broke the Windows buildbots.
msg167702 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年08月08日 18:13
New changeset e0f3406c43e4 by Victor Stinner in branch 'default': Issue #13072: Fix test_array for Windows with 16-bit wchar_t http://hg.python.org/cpython/rev/e0f3406c43e4
msg167703 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年08月08日 18:23
New changeset 67a994d5657d by Victor Stinner in branch 'default': Issue #13072: Ooops, now fix test_array for Linux with 32-bit wchar_t... http://hg.python.org/cpython/rev/67a994d5657d
msg167708 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年08月08日 20:05
And the test fails on machines without ctypes :)
msg167732 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年08月08日 22:47
New changeset 4ee4cceda047 by Victor Stinner in branch 'default': Issue #13072: Fix test_array for installation without the ctypes module http://hg.python.org/cpython/rev/4ee4cceda047
msg167936 - (view)	Author: Georg Brandl (georg.brandl) * (Python committer)	Date: 2012年08月11日 06:33
Deferring.
msg167947 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年08月11日 09:43
Is there anything that still needs to be done on this issue? ISTM that the code is correct as it stands (i.e. Getting a buffer now only uses valid format codes)
msg167997 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2012年08月11日 19:16
There's still work to be done. The current status in 3.3 trunk is that: Wide build: >>> memoryview(array("u")).format 'w' Narrow build: >>> memoryview(array("u")).format 'u' Neither of these are valid struct formats, thus they don't play nicely with the assumptions of memoryview (or any other PEP 3118 consumer). Stefan's memoryview changes are needed because there are valid struct formats that memoryview doesn't understand (yet), but it's only coincidental that they will reduce the severity of this problem. Victor's latest patch switches the 'w' and 'u' for the appropriate integer sizes 'I' and 'H' which I think is an excellent approach. There are also the post-reversion documentation changes Georg requested to bring the docs back into line with PEP 393
msg168005 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年08月11日 20:07
> Wide build: >>>> memoryview(array("u")).format > 'w' > > Narrow build: >>>> memoryview(array("u")).format > 'u' > > Neither of these are valid struct formats, thus they don't play > nicely with the assumptions of memoryview (or any other PEP 3118 > consumer). Why do you say that? They have been added by PEP 3118 (and are just not implemented in the struct module yet). If you think that their mentioning in PEP 3118 is a mistake, and they should not get implemented in struct, we should a) get consensus on that interpretation of the PEP, and b) actually remove them from the PEP, since otherwise it is very confusing that they keep being mentioned. I believe that the addition of these codes was fully intended by the PEP author, and also part of its acceptance. If these codes are indeed meant to be in the struct module, this usage in the array module looks right to me - hence my proposal to close the issue (the documentation problem aside). I agree that it is then desirable that the memoryview object supports the codes. However, this is separate issue from this one (as the codes are not invalid, just unsupported).
msg168369 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2012年08月16日 10:41
Adding a link to #15625, which is discussing the other end of this issue (whether or not memorview should support 'u' as a typecode).
msg168373 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年08月16日 11:46
Based on the discussion in #15625, it seems that the consensus is to take no action on the format codes in this issue for 3.3, and reconsider in 3.4, to determine in what way the struct module should support Unicode. Instead, the 'u' array code will be deprecated, in the same way in which the rest of the Py_UNICODE API is deprecated.
msg168558 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月19日 11:26
If everyone agrees on deprecating 'u', here's a patch. I think that should be sufficient to close this issue (unless we absolutely need deprecation warnings).
msg168561 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年08月19日 11:48
> If everyone agrees on deprecating 'u', here's a patch. I think > that should be sufficient to close this issue (unless we absolutely > need deprecation warnings). I think a proper deprecation warning is preferable.
msg168567 - (view)	Author: Alyssa Coghlan (ncoghlan) * (Python committer)	Date: 2012年08月19日 12:59
I guess the analogy with bytes objects is that UCS-2 code points can be handled as 16-bit integer objects. If we're going to do a programmatic deprecation now, that's the only alternative typecode currently available. Do we want to recommend that? Or do we want to postpone programmatic deprecation until we add a 2-byte code point type code for 3.4?
msg168571 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2012年08月19日 13:13
> I guess the analogy with bytes objects is that UCS-2 code points can be > handled as 16-bit integer objects. > > If we're going to do a programmatic deprecation now, that's the only > alternative typecode currently available. Do we want to recommend that? Or > do we want to postpone programmatic deprecation until we add a 2-byte code > point type code for 3.4? I don't understand. If you want to handle 16-bit integers, you already have the "h" and "H" type codes.
msg168575 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月19日 14:07
Since actual removal is only scheduled for 4.0, I think user warnings can wait until 3.4. By then, we should have sorted out the struct format codes. In this scenario we would be sort of forced to use 'C', 'U' and 'W' as the new codes, while 'u' and 'w' would continue to linger in the array module for a while.
msg169026 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2012年08月24日 14:48
Stefan, your patch array_deprecate_u.diff is fine. If you get to it, please also rephrase the clause "Python's unicode type"; not sure what the convention is to refer to Py_UNICODE now (perhaps "historical unicode type").
msg169063 - (view)	Author: Roundup Robot (python-dev) (Python triager)	Date: 2012年08月24日 18:18
New changeset 9c7515e29219 by Stefan Krah in branch 'default': Issue #13072: The array module's 'u' format code is now deprecated and http://hg.python.org/cpython/rev/9c7515e29219
msg169065 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2012年08月24日 18:22
Good, I think this can be closed then.

History
Date	User	Action	Args
2022年04月11日 14:57:22	admin	set	github: 57281
2012年08月24日 18:22:37	skrah	set	status: open -> closed type: behavior messages: + msg169065 resolution: fixed stage: needs patch -> resolved
2012年08月24日 18:18:34	python-dev	set	messages: + msg169063
2012年08月24日 14:48:47	loewis	set	messages: + msg169026
2012年08月19日 14:07:18	skrah	set	messages: + msg168575
2012年08月19日 13:13:57	pitrou	set	messages: + msg168571
2012年08月19日 12:59:03	ncoghlan	set	messages: + msg168567
2012年08月19日 11:48:47	pitrou	set	messages: + msg168561
2012年08月19日 11:26:32	skrah	set	files: + array_deprecate_u.diff messages: + msg168558
2012年08月19日 11:00:30	georg.brandl	set	priority: deferred blocker -> release blocker
2012年08月16日 11:46:23	loewis	set	messages: + msg168373
2012年08月16日 10:41:47	ncoghlan	set	messages: + msg168369
2012年08月11日 20:07:01	loewis	set	messages: + msg168005
2012年08月11日 19:16:32	ncoghlan	set	messages: + msg167997
2012年08月11日 09:43:44	loewis	set	nosy: + loewis messages: + msg167947
2012年08月11日 06:33:41	georg.brandl	set	priority: release blocker -> deferred blocker messages: + msg167936
2012年08月08日 22:47:59	python-dev	set	messages: + msg167732
2012年08月08日 20:05:05	pitrou	set	messages: + msg167708
2012年08月08日 18:23:27	python-dev	set	messages: + msg167703
2012年08月08日 18:13:21	python-dev	set	messages: + msg167702
2012年08月08日 07:40:09	pitrou	set	assignee: vstinner messages: + msg167673 stage: resolved -> needs patch
2012年08月06日 19:48:02	skrah	set	messages: + msg167571
2012年08月06日 14:02:18	ncoghlan	set	messages: + msg167566
2012年08月06日 13:19:34	skrah	set	messages: + msg167561
2012年08月06日 10:19:44	vstinner	set	messages: + msg167551
2012年08月06日 09:52:11	ncoghlan	set	messages: + msg167549
2012年08月06日 09:26:01	skrah	set	messages: + msg167547
2012年08月06日 09:07:10	skrah	set	nosy: + ncoghlan messages: + msg167546
2012年08月06日 08:47:28	skrah	set	messages: + msg167545
2012年08月06日 05:47:11	georg.brandl	set	messages: + msg167540
2012年08月05日 23:07:25	vstinner	set	priority: normal -> release blocker messages: + msg167522 versions: - Python 2.7, Python 3.2
2012年08月05日 23:05:27	vstinner	set	status: closed -> open resolution: fixed -> (no value) messages: + msg167521 files: + array_unicode_format.patch
2012年08月05日 22:54:30	python-dev	set	status: open -> closed nosy: + python-dev messages: + msg167520 resolution: fixed stage: resolved
2012年08月01日 22:15:32	vstinner	set	nosy: + georg.brandl messages: + msg167173
2012年08月01日 19:29:43	skrah	set	messages: + msg167165
2012年08月01日 13:31:08	Arfrever	set	nosy: + Arfrever
2012年08月01日 12:45:10	vstinner	set	files: + array_revert_pep393-2.patch messages: + msg167122
2012年08月01日 12:16:11	skrah	set	messages: + msg167119
2012年08月01日 10:19:44	vstinner	set	files: + array_revert_pep393.patch keywords: + patch messages: + msg167112
2012年08月01日 10:07:10	skrah	set	messages: + msg167109
2012年08月01日 06:59:35	vstinner	set	messages: + msg167091
2012年04月20日 21:14:16	skrah	set	messages: + msg158892
2012年04月16日 00:19:53	vstinner	set	messages: + msg158381
2011年10月03日 14:00:51	skrah	set	messages: + msg144818
2011年10月03日 13:34:27	vstinner	set	messages: + msg144817
2011年10月03日 10:52:42	skrah	set	messages: + msg144814
2011年10月03日 10:44:48	skrah	set	nosy: + meador.inge
2011年10月03日 10:34:33	skrah	set	messages: + msg144812
2011年10月01日 11:58:07	pitrou	set	nosy: + mark.dickinson, skrah
2011年09月30日 00:09:50	vstinner	create

homepage