homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Bytes objects pickled in 3.x with protocol <=2 are unpickled incorrectly in 2.x
Type: behavior Stage: resolved
Components: Library (Lib) Versions: Python 3.2, Python 3.3
process
Status: closed Resolution: fixed
Dependencies: Superseder:
Assigned To: alexandre.vassalotti Nosy List: alexandre.vassalotti, irmen, meador.inge, pitrou, python-dev, sbt
Priority: high Keywords: patch

Created on 2011年11月30日 02:21 by pitrou, last changed 2022年04月11日 14:57 by admin. This issue is now closed.

Files
File name Uploaded Description Edit
issue13505-0.patch meador.inge, 2011年12月10日 23:15 Patch against tip (3.3.0a0) review
issue13505-codecs-encode.patch sbt, 2011年12月12日 21:45
Messages (15)
msg148635 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年11月30日 02:21
In Python 3.2:
>>> pickle.dumps(b'xyz', protocol=2)
b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01(KxKyKze\x85q\x02Rq\x03.'
In Python 2.7:
>>> pickle.loads(b'\x80\x02c__builtin__\nbytes\nq\x00]q\x01(KxKyKze\x85q\x02Rq\x03.')
'[120, 121, 122]'
The problem is that the bytes() constructor argument is a list of ints, which gives a different result when reconstructed under 2.x where bytes is an alias of str:
>>> pickletools.dis(pickle.dumps(b'xyz', protocol=2))
 0: \x80 PROTO 2
 2: c GLOBAL '__builtin__ bytes'
 21: q BINPUT 0
 23: ] EMPTY_LIST
 24: q BINPUT 1
 26: ( MARK
 27: K BININT1 120
 29: K BININT1 121
 31: K BININT1 122
 33: e APPENDS (MARK at 26)
 34: \x85 TUPLE1
 35: q BINPUT 2
 37: R REDUCE
 38: q BINPUT 3
 40: . STOP
highest protocol among opcodes = 2
Bytearray objects use a different trick: they pass a (unicode string, encoding) pair which has the same constructor semantics under 2.x and 3.x. Additionally, such encoding is statistically more efficient: a list of 1-byte ints will take 2 bytes per encoded char, while a latin1-to-utf8 transcoded string (BINUNICODE uses utf-8) will take on average 1.5 bytes per encoded char (assuming a 50% probability of higher-than-127 bytes).
>>> pickletools.dis(pickle.dumps(bytearray(b'xyz'), protocol=2))
 0: \x80 PROTO 2
 2: c GLOBAL '__builtin__ bytearray'
 25: q BINPUT 0
 27: X BINUNICODE 'xyz'
 35: q BINPUT 1
 37: X BINUNICODE 'latin-1'
 49: q BINPUT 2
 51: \x86 TUPLE2
 52: q BINPUT 3
 54: R REDUCE
 55: q BINPUT 4
 57: . STOP
highest protocol among opcodes = 2
msg148692 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年11月30日 22:48
After a bit of testing, my idea was flawed, as str() doesn't accept an encoding parameter in 2.x: `str(u'foo', 'latin1')` simply raises a TypeError.
msg148904 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2011年12月06日 03:17
I think we are kind of stuck here. I might need to rely on some clever hack to generate the desired str object in 2.7 without breaking the bytes support in 3.3 and without changing 2.7 itself.
One *dirty* trick I am thinking about would be to use something like array.tostring() to construct the byte string.
 from array import array
 class bytes:
 def __reduce__(self):
 return (array.tostring, (array('B', self),))
Of course, this doesn't work because pickle doesn't method pickling. But, maybe someone can figure out a way around this... I don't know.
Also, this is a bit annoying to fix since we changed the semantic meaning of the STRING opcodes in 3.x---i.e., it now represents a unicode string instead of a byte string.
msg148911 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011年12月06日 12:25
> One *dirty* trick I am thinking about would be to use something like 
> array.tostring() to construct the byte string.
array('B', ...) objects are pickled using two bytes per character, so there would be no advantage:
 >>> pickle.dumps(array.array('B', b"hello"), 2)
 b'\x80\x02carray\narray\nq\x00X\x01\x00\x00\x00Bq\x01]q\x02(KhKeKlKlKoe\x86q\x03Rq\x04.'
msg149072 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2011年12月09日 02:39
sbt, the bug is not that the encoding is inefficient. The problem is we cannot unpickle bytes streams from Python 3 using Python 2.
msg149093 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011年12月09日 13:31
> sbt, the bug is not that the encoding is inefficient. The problem is we 
> cannot unpickle bytes streams from Python 3 using Python 2.
Ah. Well you can do it using codecs.encode.
Python 3.3.0a0 (default, Dec 8 2011, 17:56:13) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle, codecs
>>>
>>> class MyBytes(bytes):
... def __reduce__(self):
... return codecs.encode, (self.decode('latin1'), 'latin1')
...
>>> pickle.dumps(MyBytes(b"hello"), 2)
b'\x80\x02c_codecs\nencode\nq\x00X\x05\x00\x00\x00helloq\x01X\x06\x00\x00\x00latin1q\x02\x86q\x03Rq\x04.'
Actually, I notice that array objects created by Python 3 are not decodable on Python 2. See Issue 13566.
msg149114 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年12月09日 18:12
> > sbt, the bug is not that the encoding is inefficient. The problem is we 
> > cannot unpickle bytes streams from Python 3 using Python 2.
> 
> Ah. Well you can do it using codecs.encode.
Great. A bit hackish but functional and not too inefficient (50% average
expansion).
msg149197 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011年12月10日 23:15
I don't really know that much about pickle, but Antoine mentioned that 'bytearray'
works fine going from 3.2 to 2.7. Given that, can't we just compose 'bytes' with
'bytearray'? Something like:
Python 3.3.0a0 (default:aab45b904141+, Dec 10 2011, 13:34:41)
[GCC 4.6.2 20111027 (Red Hat 4.6.2-1)] on linux
Type "help", "copyright", "credits" or "license" for more information.
...
>>> class Bytes(bytes):
... def __reduce__(self):
... return bytes, (bytearray(self),)
... 
>>> pickletools.dis(pickle.dumps(Bytes(b'abc'), protocol=2))
 0: \x80 PROTO 2
 2: c GLOBAL '__builtin__ bytes'
 21: q BINPUT 0
 23: c GLOBAL '__builtin__ bytearray'
 46: q BINPUT 1
 48: X BINUNICODE 'abc'
 56: q BINPUT 2
 58: X BINUNICODE 'latin-1'
 70: q BINPUT 3
 72: \x86 TUPLE2
 73: q BINPUT 4
 75: R REDUCE
 76: q BINPUT 5
 78: \x85 TUPLE1
 79: q BINPUT 6
 81: R REDUCE
 82: q BINPUT 7
 84: . STOP
highest protocol among opcodes = 2
>>> pickle.dumps(Bytes(b'abc'), protocol=2)
b'\x80\x02c__builtin__\nbytes\nq\x00c__builtin__\nbytearray\nq\x01X\x03\x00\x00\x00abcq\x02X\x07\x00\x00\x00latin-1q\x03\x86q\x04Rq\x05\x85q\x06Rq\x07.'
[meadori@motherbrain cpython]$ python
Python 2.7.2 (default, Oct 27 2011, 01:40:22) 
[GCC 4.6.1 20111003 (Red Hat 4.6.1-10)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
...
>>> pickle.loads(b'\x80\x02c__builtin__\nbytes\nq\x00c__builtin__\nbytearray\nq\x01X\x03\x00\x00\x00abcq\x02X\x07\x00\x00\x00latin-1q\x03\x86q\x04Rq\x05\x85q\x06Rq\x07.')
'abc'
If this method is OK, then the patch is pretty simple. See attached.
msg149232 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011年12月11日 18:17
> I don't really know that much about pickle, but Antoine mentioned that 'bytearray'
> works fine going from 3.2 to 2.7. Given that, can't we just compose 'bytes' with
> 'bytearray'?
Yes, although it would only work for 2.6 and 2.7.
codecs.encode() seems to be available back to 2.4 and codecs.latin_1_encode() back to at least 2.0. They also produce more compact pickles, particularly codecs.latin_1_encode().
>>> class Bytes(bytes):
... def __reduce__(self):
... return latin_1_encode, (latin_1_decode(self),)
...
[70922 refs]
>>> pickletools.dis(pickle.dumps(Bytes(b'abc'), 2))
 0: \x80 PROTO 2
 2: c GLOBAL '_codecs latin_1_encode'
 26: q BINPUT 0
 28: X BINUNICODE 'abc'
 36: q BINPUT 1
 38: K BININT1 3
 40: \x86 TUPLE2
 41: q BINPUT 2
 43: \x85 TUPLE1
 44: q BINPUT 3
 46: R REDUCE
 47: q BINPUT 4
 49: . STOP
highest protocol among opcodes = 2
Only worry is that codecs.latin_1_encode.__module__ is '_codecs', and _codecs is undocumented.
msg149338 - (view) Author: Antoine Pitrou (pitrou) * (Python committer) Date: 2011年12月12日 17:39
> Only worry is that codecs.latin_1_encode.__module__ is '_codecs', and
> _codecs is undocumented.
It seems we have to choose between two evils here. Given that the codecs.latin_1_encode produces more compact pickles, I'd say go for it.
Note that for the empty bytes object (b""), the encoding can be massively simplified by simply calling bytes() with no argument.
msg149359 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011年12月12日 21:45
I now realise latin_1_encode won't work because it returns a pair (bytes_obj, length).
I have done a patch using _codecs.encode instead -- the pickles turn out to be exactly the same size anyway.
>>> pickletools.dis(pickle.dumps(b"abc", 2))
 0: \x80 PROTO 2
 2: c GLOBAL '_codecs encode'
 18: q BINPUT 0
 20: X BINUNICODE 'abc'
 28: q BINPUT 1
 30: X BINUNICODE 'latin1'
 41: q BINPUT 2
 43: \x86 TUPLE2
 44: q BINPUT 3
 46: R REDUCE
 47: q BINPUT 4
 49: . STOP
msg149368 - (view) Author: Meador Inge (meador.inge) * (Python committer) Date: 2011年12月13日 00:51
On Sun, Dec 11, 2011 at 12:17 PM, sbt <report@bugs.python.org> wrote:
>> I don't really know that much about pickle, but Antoine mentioned that 'bytearray'
>> works fine going from 3.2 to 2.7. Given that, can't we just compose 'bytes' with
>> 'bytearray'?
>
> Yes, although it would only work for 2.6 and 2.7.
Which is fine. 'bytes' and byte literals were not introduced until
2.6 [1,2]. So *any* solution we come
up with is for >= 2.6.
> They also produce more compact pickles, particularly codecs.latin_1_encode().
Now that is a better argument.
[1] http://www.python.org/dev/peps/pep-0358/
[2] http://www.python.org/dev/peps/pep-3112/ 
msg149371 - (view) Author: Richard Oudkerk (sbt) * (Python committer) Date: 2011年12月13日 01:54
> Which is fine. 'bytes' and byte literals were not introduced until
> 2.6 [1,2]. So *any* solution we come
> up with is for >= 2.6.
In 2.6 and 2.7, bytes is just an alias for str. In all 2.x versions with codecs.encode, the result will be str. (Although I haven't actually tested earlier than 2.6.)
Python 2.6.5 (r265:79063, Jun 12 2010, 17:07:01)
[GCC 4.3.4 20090804 (release) 1] on cygwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import pickle
>>> pickle.loads('\x80\x02c_codecs\nencode\nq\x00X\x03\x00\x00\x00abcq\x01X\x06\x00\x00\x00latin1q\x02\x86q\x03Rq\x04.')
'abc'
>>> type(_)
<type 'str'>
msg149399 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2011年12月13日 18:23
New changeset 14695b4825dc by Alexandre Vassalotti in branch '3.2':
Issue #13505: Make pickling of bytes object compatible with Python 2.
http://hg.python.org/cpython/rev/14695b4825dc 
msg149400 - (view) Author: Alexandre Vassalotti (alexandre.vassalotti) * (Python committer) Date: 2011年12月13日 18:29
Fixed. Thanks for the patch!
History
Date User Action Args
2022年04月11日 14:57:24adminsetgithub: 57714
2011年12月13日 18:29:50alexandre.vassalottisetstatus: open -> closed
messages: + msg149400

assignee: alexandre.vassalotti
resolution: fixed
stage: needs patch -> resolved
2011年12月13日 18:23:20python-devsetnosy: + python-dev
messages: + msg149399
2011年12月13日 01:54:48sbtsetmessages: + msg149371
2011年12月13日 00:51:42meador.ingesetmessages: + msg149368
2011年12月12日 21:45:13sbtsetfiles: + issue13505-codecs-encode.patch

messages: + msg149359
2011年12月12日 17:39:16pitrousetmessages: + msg149338
2011年12月11日 18:17:53sbtsetmessages: + msg149232
2011年12月10日 23:15:12meador.ingesetfiles: + issue13505-0.patch
keywords: + patch
messages: + msg149197
2011年12月09日 18:12:58pitrousetmessages: + msg149114
2011年12月09日 13:31:16sbtsetmessages: + msg149093
2011年12月09日 02:39:47alexandre.vassalottisetmessages: + msg149072
2011年12月06日 12:25:47sbtsetnosy: + sbt
messages: + msg148911
2011年12月06日 03:17:21alexandre.vassalottisetmessages: + msg148904
2011年11月30日 22:48:26pitrousetmessages: + msg148692
2011年11月30日 05:33:09meador.ingesetnosy: + meador.inge

stage: needs patch
2011年11月30日 02:21:47pitroucreate

AltStyle によって変換されたページ (->オリジナル) /