Issue 2389: Array pickling exposes internal memory representation of elements

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/46642

classification

Title:	Array pickling exposes internal memory representation of elements
Type:	behavior	Stage:
Components:	Extension Modules	Versions:	Python 3.2, Python 2.7

process

Dependencies:	Superseder:
Status:	closed	Resolution:	fixed
Assigned To:	Nosy List:	alexandre.vassalotti, benjamin.peterson, collinwinter, gvanrossum, hniksic, jcea, loewis, rhettinger
Priority:	critical	Keywords:	patch

Created on 2008年03月18日 14:38 by hniksic, last changed 2022年04月11日 14:56 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
portable_array_pickling.diff	alexandre.vassalotti, 2009年06月27日 03:43
portable_array_pickling-2.diff	alexandre.vassalotti, 2009年07月06日 23:30

Messages (21)
msg63915 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2008年03月18日 14:38
It would seem that pickling arrays directly exposes the underlying machine words, making the pickle non-portable to platforms with different layout of array elements. The guts of array.__reduce__ look like this: if (array->ob_size > 0) { result = Py_BuildValue("O(cs#)O", array->ob_type, array->ob_descr->typecode, array->ob_item, array->ob_size * array->ob_descr->itemsize, dict); } The byte string that is pickled is directly created from the array's contents. Unpickling calls array_new which in turn calls array_fromstring, which ends up memcpying the string data to the new array. As far as I can tell, array pickles created on one platform cannot be unpickled on a platform with different endianness (in case of integer arrays), wchar_t size (in case of unicode arrays) or floating-point representation (rare in practice, but possible). If pickles are supposed to be platform-independent, this should be fixed. Maybe the "typecode" field when used with the constructor could be augmented to include information about the elements, such as endianness and floating-point format. Or we should simply punt and pickle the array as a list of Python objects that comprise it...?
msg64236 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2008年03月21日 12:14
Here is an example that directly demonstrates the bug. Pickling on x86_64: Python 2.5.1 (r251:54863, Mar 21 2008, 13:06:31) [GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import array, cPickle as pickle >>> pickle.dumps(array.array('l', [1, 2, 3])) "carray\narray\np1\n(S'l'\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp2\n." Unpickling on ia32: Python 2.5.1 (r251:54863, Oct 5 2007, 13:36:32) [GCC 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import cPickle as pickle >>> pickle.loads("carray\narray\np1\n(S'l'\nS'\\x01\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x02\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x03\\x00\\x00\\x00\\x00\\x00\\x00\\x00'\ntRp2\n.") array('l', [1, 0, 2, 0, 3, 0])
msg65469 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2008年04月14日 18:01
This looks indeed wrong. Unfortunately it also looks hard to fix in a way that won't break unpickling arrays pickled by a previous Python version. We won't be able to fix this in 2.5 (it'll be a new feature) but we should try to fix this in 2.6 and 3.0.
msg70473 - (view)	Author: Benjamin Peterson (benjamin.peterson) * (Python committer)	Date: 2008年07月31日 02:07
Ping.
msg70491 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2008年07月31日 08:06
At this point, I think it better to wait until Py2.7/3.1. Changing it now would just complicate efforts to port from 2.5 to 2.6 to 3.0. Guido, do you agree?
msg70743 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2008年08月05日 15:46
Agreed, this has been broken for a long time, and few people have noticed or complained. Let's wait.
msg70774 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2008年08月06日 07:29
I guess it went unnoticed due to prevalence of little-endian 32-bit machines. With 64-bit architectures becoming more and more popular, this might become a bigger issue. Raymond, why do you think fixing this bug would complicate porting to 2.6/3.0?
msg71000 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2008年08月11日 05:27
I don't see why this cannot be fixed easily. All we need to do is fix the __reduce__ method of array objects to emit a list--i.e. with array.tolist()--instead of a memory string. Since the reduce protocol is just a fancy way to store the constructor arguments, this won't break unpickling of array objects pickled by previous Python versions. And here is a patch against the trunk.
msg71037 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2008年08月11日 23:41
Wouldn't that be lots and lots slower? I believe speed is one of the reasons why the binary representation is currently dumped.
msg71044 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2008年08月12日 05:45
The slowdown depends of the array type. The patch makes array unpickling a few orders of magnitude slower (i.e. between 4 and 15 times slower depending of the array type). In general, pickling is about as fast as with the binary representation (or faster!). Although since most 64-bit compilers uses the LP64 model, I think we could make a compromise and only pickle as a list arrays of long integers. This would fix the problem without any visible speed penalties.
msg71048 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2008年08月12日 08:00
Unfortunately dumping the internal representation of non-long arrays won't work, for several reasons. First, it breaks when porting pickles between platforms of different endianness such as Intel and SPARC. Then, it ignores the considerable work put into correctly pickling floats, including the support for IEEE 754 special values. Finally, it will break when unpickling Unicode character arrays pickled on different Python versions -- wchar_t is 2 bytes wide on Windows, 4 bytes on Unix. I believe pickling arrays to compact strings is the right approach on the grounds of efficiency and I wouldn't change it. We must only be careful to pickle to a string with a portable representation of values. The straightforward way to do this is to pick a "standard" size for types (much like the struct module does) and endianness and use it in the pickled array. Ints are simple, and the code for handling floats is already there, for example _PyFloat_Pack8 used by cPickle. Pickling arrays as lists is probably a decent workaround for the pending release because it's backward and forward compatible (old pickles will work as well as before and new pickles will be correctly read by old Python versions), but for the next release I would prefer to handle this the right way. If there is agreement on this, I can start work on a patch in the following weeks.
msg71050 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年08月12日 10:11
I like to challenge the view what "correct" behavior is here. If I pickle an array of 32-bit integer values on one system, and unpickle it as an array of 64-bit integer values on a different system, is that correct, or incorrect? IMO, correct behavior would preserve the width as much as possible. For integers, this should be straight-forward, as it should be for floats and doubles (failing to unpickle them if the target system doesn't support a certain format). For Unicode, I think the array module should grow platform-independent width, for both 2-byte and 4-byte Unicode. When pickling, the pickle should always use network byte order; alternatively, the pickle should contain a byte order marker (presence of which could also be used as an indication that the new array pickle format is used). IOW, <i would indicate little-endian four byte integers, and so on.
msg71051 - (view)	Author: Hrvoje Nikšić (hniksic) *	Date: 2008年08月12日 10:29
I think preserving integer width is a good idea because it saves us from having to throw overflow errors when unpickling to machines with different width of C types. The cost is that pickling/unpickling the array might change the array's typecode, which can be a problem for C code that processes the array's buffer and expects the C type to remain invariant. Instead of sticking to network byte order, I propose to include byte order information in the pickle (for example as '<' or '>' like struct does), so that pickling/unpickling between the same-endianness architectures doesn't have to convert at all. Floats are always pickled as IEEE754, but the same optimization (not having to convert anything) would apply when unpickling a float array on an IEEE754 architecture. Preserving widths and including endianness information would allow pickling to be as fast as it is now (with the exception of unicode chars and floats on non-IEEE754 platforms). It would also allow unpickling to be as fast between architecture with equal endianness, and correct between others.
msg71067 - (view)	Author: Guido van Rossum (gvanrossum) * (Python committer)	Date: 2008年08月12日 18:29
> Instead of sticking to network byte order, I propose to include byte > order information in the pickle (for example as '<' or '>' like struct > does), so that pickling/unpickling between the same-endianness > architectures doesn't have to convert at all. Floats are always pickled > as IEEE754, but the same optimization (not having to convert anything) > would apply when unpickling a float array on an IEEE754 architecture. > > Preserving widths and including endianness information would allow > pickling to be as fast as it is now (with the exception of unicode chars > and floats on non-IEEE754 platforms). It would also allow unpickling to > be as fast between architecture with equal endianness, and correct > between others. This sounds like the best approach yet -- it can be made backwards compatible (so 2.6 can read 2.5 pickles at least on the same platform) and can be just as fast when unpickling on the same platform, and only slightly slower on a different platform.
msg71109 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2008年08月14日 06:21
I'm all in for a standardized representation of array's pickles (with width and endianness preserved). However to happen, we will either need to change array's constructor to support at least the byte-order specification (like struct) or add built-in support for array in the pickle module (which could be done without modifying the pickle protocol).
msg71110 - (view)	Author: Martin v. Löwis (loewis) * (Python committer)	Date: 2008年08月14日 07:18
I think changing the array constructor is fairly easy: just pick a set of codes that are defined to be platform-neutral (i.e. for each size two codes, one for each endianness). For example, the control characters (0円..\x1F) could be used in the following way: char, signed-byte, unsigned byte: c, b, B (big/little) sint16: 1,2 uint16: 3,4 sint32: 5,6 uint32: 7,8 sint64: 9,10 uint64: 11,12 float: 13,14 double: 15,16 UCS-2: 17,18 UCS-4: 19,20 In above scheme, even codes are little-endian, odd codes are big endian. Converting the codes to "native" codes could be table-driven.
msg85298 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2009年04月03日 07:51
Ah, I just remembered the smart way I had devised some time ago to handle this issue without changing the constructor of array.array. The trick would be to add a __reduce__ method to array.array. This method would return a special constructor function, the binary data of the array and a string representing the format of the array. Upon unpickling, the special constructor function would be called with the binary data and its format and then it would recreate the array. Now, the only thing I am not sure about is whether this would work well with subclasses of array.array. I guess we make __reduce__ also return the instance's type which could be used by special constructor to recreate the instance from the proper subclass.
msg89751 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2009年06月27日 03:43
Here's a patch that implements the solution I described in msg85298. Please give it a good review: http://codereview.appspot.com/87072
msg90162 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2009年07月05日 22:47
I would like to commit my patch later this week. So if you see any issue with the patch, please speak up.
msg90197 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2009年07月06日 23:30
I know believe that arrays should be pickled as a list of values on Python 2.x. Doing otherwise makes it impossible to unpickle arrays coming from Python 2.x using Python 3.x, since pickle on Python 3 decodes all the strings from 2.x to Unicode. However, we still can use the compact memory representation on Python 3.x. So, I propose that we change the array module on Python 2.x to emit a list instead of memory string and implement the portable array pickling mechanism only on Python 3.x.
msg90541 - (view)	Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer)	Date: 2009年07月15日 18:22
Committed fix for 3.x in r74013 and for 2.x in r74014.

History
Date	User	Action	Args
2022年04月11日 14:56:32	admin	set	github: 46642
2009年07月15日 18:22:30	alexandre.vassalotti	set	status: open -> closed resolution: fixed messages: + msg90541
2009年07月06日 23:30:40	alexandre.vassalotti	set	files: + portable_array_pickling-2.diff messages: + msg90197 versions: + Python 3.2, - Python 3.1
2009年07月05日 22:47:43	alexandre.vassalotti	set	files: - fix_array_pickling.patch
2009年07月05日 22:47:24	alexandre.vassalotti	set	messages: + msg90162
2009年06月27日 03:43:08	alexandre.vassalotti	set	files: + portable_array_pickling.diff messages: + msg89751
2009年04月04日 00:32:11	collinwinter	set	nosy: + collinwinter
2009年04月03日 07:51:55	alexandre.vassalotti	set	messages: + msg85298
2008年08月14日 07:18:30	loewis	set	messages: + msg71110
2008年08月14日 06:21:37	alexandre.vassalotti	set	messages: + msg71109
2008年08月12日 18:29:18	gvanrossum	set	messages: + msg71067
2008年08月12日 10:29:30	hniksic	set	messages: + msg71051
2008年08月12日 10:11:10	loewis	set	nosy: + loewis messages: + msg71050
2008年08月12日 08:00:28	hniksic	set	messages: + msg71048
2008年08月12日 05:45:12	alexandre.vassalotti	set	messages: + msg71044
2008年08月11日 23:41:02	gvanrossum	set	messages: + msg71037
2008年08月11日 05:27:18	alexandre.vassalotti	set	files: + fix_array_pickling.patch nosy: + alexandre.vassalotti messages: + msg71000 keywords: + patch
2008年08月06日 07:29:14	hniksic	set	messages: + msg70774
2008年08月05日 15:46:55	gvanrossum	set	assignee: gvanrossum -> messages: + msg70743 versions: + Python 3.1, Python 2.7, - Python 2.6, Python 3.0
2008年07月31日 08:06:10	rhettinger	set	assignee: gvanrossum messages: + msg70491 nosy: + rhettinger
2008年07月31日 02:07:40	benjamin.peterson	set	nosy: + benjamin.peterson messages: + msg70473
2008年04月26日 03:10:36	jcea	set	nosy: + jcea
2008年04月14日 18:01:59	gvanrossum	set	priority: critical nosy: + gvanrossum messages: + msg65469 versions: + Python 3.0, - Python 2.5
2008年03月21日 12:14:10	hniksic	set	messages: + msg64236
2008年03月18日 14:38:08	hniksic	create

homepage