homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Reproducible pyc: frozenset is not serialized in a deterministic order
Type: Stage: resolved
Components: Interpreter Core Versions: Python 3.11
process
Status: closed Resolution:
Dependencies: Superseder:
Assigned To: brandtbucher Nosy List: FFY00, brandtbucher, gvanrossum, jefferyto, lukasz.langa, methane, obfusk, pablogsal, rhettinger, serhiy.storchaka, vstinner
Priority: normal Keywords: patch

Created on 2019年07月15日 15:05 by vstinner, last changed 2022年04月11日 14:59 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 27769 closed FFY00, 2021年08月14日 15:48
PR 27926 merged brandtbucher, 2021年08月24日 04:48
PR 28068 merged brandtbucher, 2021年08月30日 15:25
PR 28147 merged brandtbucher, 2021年09月03日 19:02
Messages (36)
msg347969 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2019年07月15日 15:05
See bpo-29708 meta issue and https://reproducible-builds.org/ for reproducible builds.
pyc files are not fully reproducible yet: frozenset items are not serialized in a deterministic order
One solution would be to modify marshal to sort frozenset items before serializing them. The issue is how to handle items which cannot be compared. Example:
>>> l=[float("nan"), b'bytes', 'unicode']
>>> l.sort()
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
TypeError: '<' not supported between instances of 'bytes' and 'float'
One workaround for types which cannot be compared is to use the type name in the key used to compare items:
>>> l.sort(key=lambda x: (type(x).__name__, x))
>>> l
[b'bytes', nan, 'unicode']
Note: comparison between bytes and str raises a BytesWarning exception when using python3 -bb.
Second problem: how to handle exceptions when comparison raises an error anyway?
Another solution would be to use the PYTHONHASHSEED environment variable. For example, if SOURCE_DATE_EPOCH is set, PYTHONHASHSEED would be set to 0. This option is not my favorite because it disables a security fix against denial of service on dict and set:
https://python-security.readthedocs.io/vuln/hash-dos.html
--
Previous discussions on reproducible frozenset:
* https://mail.python.org/pipermail/python-dev/2018-July/154604.html
* https://bugs.python.org/issue34093#msg321523
See also bpo-34093: "Reproducible pyc: FLAG_REF is not stable" and PEP 552 "Deterministic pycs".
msg366124 - (view) Author: (yan12125) * Date: 2020年04月10日 13:22
issue34722 also talks about frozenset, nondeterministic order and sorting. Maybe this ticket and that one are for the same issue?
msg391118 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年04月15日 02:38
Normal sets have the same issue, see bpo-43850.
Would it be reasonable to make it so that sets are always created with the definition order? Looking at the set implementation, this seems perfectly possible.
msg391156 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年04月15日 23:49
> Would it be reasonable to make it so that sets are 
> always created with the definition order?
No, it would not. We would also have to maintain order across set operations such as intersection which which would become dramatically more expensive if they had to maintain order. For example intersecting a million element set with a ten element set always takes ten steps regardless of the order of arguments, but to maintain order of the left hand operand could take a hundred times more work.
msg391157 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年04月16日 00:10
> No, it would not. We would also have to maintain order across set operations such as intersection which which would become dramatically more expensive if they had to maintain order. For example intersecting a million element set with a ten element set always takes ten steps regardless of the order of arguments, but to maintain order of the left hand operand could take a hundred times more work.
Can these operations happen during bytecode generation? I am fairly new to these internals so my understanding is not great. During bytecode generation is can code that performs such operations run?
msg391158 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年04月16日 00:12
s/hundred/hundred thousand/
msg391159 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年04月16日 00:13
s/is can/can/
msg393465 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年05月11日 16:35
Another idea, would it be possible to add a flag to turn on reproducibility, sacrificing performance? This flag could be set when generating bytecode, where the performance hit shouldn't be that relevant.
msg394311 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年05月25日 10:17
> Another idea, would it be possible to add a flag to turn on reproducibility, sacrificing performance?
The flag is the SOURCE_DATE_EPOCH env var, no?
msg394361 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年05月25日 14:35
I would not expect SOURCE_DATE_EPOCH to sacrifice performance. During packaging, SOURCE_DATE_EPOCH is always set, and sometimes we need to perform expensive operations. We only need this behavior during cache generation, making the solution not optimal.
Backtracking a bit to your proposal for sorting the elements. Is it possible to have two different types with the same name? We need a unique identifier for each type.
After that, we need the type to allow sorting/comparing items, which AFAIK is not something we can guarantee.
We could certainly do the sorting where we are able to, and bail out if impossible, which I feel should handle the majority of cases. This is not optimal, but reasonable.
Is there any way we could something like resetting the hash seed during cache generation?
msg394373 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021年05月25日 15:57
Possible solution: add an ordered subtype of frozenset which would keep an array of items in the original order. The compiler only creates frozenset when optimizes "x in {1, 2}" or "for x in {1, 2}". It should now create an ordered frozenset from a list of constants (removing possible duplicates). The marshal module should save items in that order and restore ordered frozensets when load data. It should not increase memory consumption too much, because frozenset constants in code are rare and small.
msg394377 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年05月25日 16:39
What about normal sets? They also suffer from the same issue.
msg394419 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021年05月26日 07:57
> What about normal sets?
pyc files don't contain a regular set. So it is out of scope of this issue.
msg394431 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年05月26日 12:48
Ah, my bad! Though, thinking about it, it does make sense. If that's the case, then the argument Raymond provided against preserving order does not seem that relevant, as we would only need to preserve the order in the creation operation. What do you think? Is there anything I may be missing here? :)
msg394501 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2021年05月27日 00:44
> If that's the case, then the argument Raymond provided against preserving order does not seem that relevant, as we would only need to preserve the order in the creation operation.
Note that PYC files are marshalled from code objects including frozenset instance, not from AST.
When marshaling, frozenset instance is created already and its creation order is lost .
msg394524 - (view) Author: Filipe Laíns (FFY00) * (Python triager) Date: 2021年05月27日 09:23
I understand, the proposal would be to make frozensets keep the creation order.
msg398563 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021年07月30日 16:00
> I understand, the proposal would be to make frozensets keep the creation order.
That would increase the memory consumption of all frozen set instances, which is likely not going to fly
msg398565 - (view) Author: Pablo Galindo Salgado (pablogsal) * (Python committer) Date: 2021年07月30日 16:03
The only way I can see here is to go with a similar strategy as Serhiy proposes, which seems that it has a non trivial complication (and a new type, which I am not very fond of) but is a bit cleaner than changing the semantics of the type, which affects much more than the storage.
msg400153 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年08月23日 17:37
Could this issue be fixed in marshal itself? Off the top of my head, one possible option could be to use the marshalled bytes of each elements as a sort key, rather than the elements themselves. So serialize, *then* sort?
msg400156 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2021年08月23日 17:51
No, it cannot be fixed in marshal itself.
s = {("string", 1), ("string", 2), ("string", 3)}
All tuples contain references to the same string. The first serialized tuple will contain serialization of the string, all other will contain references to it. So the binary representation of the tuple depends on whether it is serialized first on not first.
msg400159 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年08月23日 18:06
Ah, yeah.
Could we add a flag to disable the reference mechanism, just for frozenset elements? It would make marshalled frozensets a bit bigger (unless we re-marshalled each one after sorting)... but I still prefer that to adding more logic/subclasses to frozenset.
msg400166 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年08月23日 19:10
This rough proof-of-concept seems to have the desired effect:
diff --git a/Python/marshal.c b/Python/marshal.c
index 1260704c74..70f9c4b109 100644
--- a/Python/marshal.c
+++ b/Python/marshal.c
@@ -503,9 +503,23 @@ w_complex_object(PyObject *v, char flag, WFILE *p)
 W_TYPE(TYPE_SET, p);
 n = PySet_GET_SIZE(v);
 W_SIZE(n, p);
- while (_PySet_NextEntry(v, &pos, &value, &hash)) {
+ PyObject *pairs = PyList_New(0);
+ for (Py_ssize_t i = 0; _PySet_NextEntry(v, &pos, &value, &hash); i++) {
+ PyObject *pair = PyTuple_New(2);
+ PyObject *dump = PyMarshal_WriteObjectToString(value, p->version);
+ PyTuple_SET_ITEM(pair, 0, dump);
+ Py_INCREF(value);
+ PyTuple_SET_ITEM(pair, 1, value);
+ PyList_Append(pairs, pair);
+ Py_DECREF(pair);
+ }
+ PyList_Sort(pairs);
+ for (Py_ssize_t i = 0; i < n; i++) {
+ PyObject *pair = PyList_GET_ITEM(pairs, i);
+ PyObject *value = PyTuple_GET_ITEM(pair, 1);
 w_object(value, p);
 }
+ Py_DECREF(pairs);
 }
 else if (PyCode_Check(v)) {
 PyCodeObject *co = (PyCodeObject *)v;
I can clean it up and convert it to a PR if we decide we want to go this route.
msg400180 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年08月23日 22:22
> I can clean it up and convert it to a PR if we decide 
> we want to go this route.
+1 This is by far the smallest intervention that has been discussed.
msg400186 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年08月24日 00:19
Here's pure python code for experimentation:
 from marshal import dumps, loads
 def marshal_set(s):
 return dumps(sorted(s, key=dumps))
 def unmarshal_set(m):
 return frozenset(loads(m))
 def test(s):
 assert unmarshal_set(marshal_set(s)) == s
 
 test({("string", 1), ("string", 2), ("string", 3)})
msg400254 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021年08月25日 11:14
New changeset 33d95c6facdfda3c8c0feffa7a99184e4abc2f63 by Brandt Bucher in branch 'main':
bpo-37596: Make `set` and `frozenset` marshalling deterministic (GH-27926)
https://github.com/python/cpython/commit/33d95c6facdfda3c8c0feffa7a99184e4abc2f63
msg400255 - (view) Author: Łukasz Langa (lukasz.langa) * (Python committer) Date: 2021年08月25日 11:20
This is a bona fide enhancement and thus out of scope for backports. Since this is merged for 3.11, I'm closing the issue.
Thanks, everyone, this was some non-trivial design and implementation effort!
msg400260 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年08月25日 13:12
Should the error paths decref the key and return NULL as they do elsewhere in the function?
msg400264 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年08月25日 13:37
Hm, not quite sure what you mean. Are you talking about just replacing each of the new gotos with "Py_DECREF(pairs); return;"?
Error handling for this whole module is a bit unconventional. Some of the error paths in this function decrement the recursion depth counter, but I *think* that’s actually incorrect here... it looks like it’s our caller’s (w_object) responsibility to do that, error or no error.
msg400270 - (view) Author: Raymond Hettinger (rhettinger) * (Python committer) Date: 2021年08月25日 14:14
Looking again, I think code is correct as-is (am not sure about the depth adjustment though).
Stylistically, it is different from the other blocks w_complex_object() that always have a "return" after setting p->error. The new code jumps to "anyset_done" and then falls through the "elif" block rather than issuing a "return".
Since nothing else happens below the if-elif-else chain, this is without consequence.
msg400670 - (view) Author: Guido van Rossum (gvanrossum) * (Python committer) Date: 2021年08月30日 21:27
Thanks! This comes right in time, because we're working on freezing many more modules, and modules containing frozen sets didn't have a consistent frozen representation. Now they do!
(See issue45019, issue45020)
msg400755 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年08月31日 16:18
New changeset 51999c960e7fc45feebd629421dec6524a5fc803 by Brandt Bucher in branch 'main':
bpo-37596: Clean up the set/frozenset marshalling code (GH-28068)
https://github.com/python/cpython/commit/51999c960e7fc45feebd629421dec6524a5fc803
msg400988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年09月03日 11:27
I reopen the issue.
test_marshal failed on AMD64 Arch Linux Usan 3.x:
https://buildbot.python.org/all/#/builders/719/builds/108
======================================================================
FAIL: test_deterministic_sets (test.test_marshal.BugsTestCase) [set([('string', 1), ('string', 2), ('string', 3)])]
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/buildbot/buildarea/3.x.pablogsal-arch-x86_64.clang-ubsan/build/Lib/test/test_marshal.py", line 365, in test_deterministic_sets
 self.assertNotEqual(repr_0, repr_1)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: b"{('string', 1), ('string', 2), ('string', 3)}\n" == b"{('string', 1), ('string', 2), ('string', 3)}\n"
======================================================================
FAIL: test_deterministic_sets (test.test_marshal.BugsTestCase) [frozenset([('string', 1), ('string', 2), ('string', 3)])]
----------------------------------------------------------------------
Traceback (most recent call last):
 File "/buildbot/buildarea/3.x.pablogsal-arch-x86_64.clang-ubsan/build/Lib/test/test_marshal.py", line 365, in test_deterministic_sets
 self.assertNotEqual(repr_0, repr_1)
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError: b"frozenset({('string', 1), ('string', 2), ('string', 3)})\n" == b"frozenset({('string', 1), ('string', 2), ('string', 3)})\n"
msg400989 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2021年09月03日 11:44
The test failed at:
 def test_deterministic_sets(self):
 # bpo-37596: To support reproducible builds, sets and frozensets need to
 # have their elements serialized in a consistent order (even when they
 # have been scrambled by hash randomization):
 for kind in ("set", "frozenset"):
 for elements in (
 "float('nan'), b'a', b'b', b'c', 'x', 'y', 'z'",
 # Also test for bad interactions with backreferencing:
 "('string', 1), ('string', 2), ('string', 3)",
 ):
 s = f"{kind}([{elements}])"
 with self.subTest(s):
 # First, make sure that our test case still has different
 # orders under hash seeds 0 and 1. If this check fails, we
 # need to update this test with different elements:
 args = ["-c", f"print({s})"]
 _, repr_0, _ = assert_python_ok(*args, PYTHONHASHSEED="0")
 _, repr_1, _ = assert_python_ok(*args, PYTHONHASHSEED="1")
 self.assertNotEqual(repr_0, repr_1) # <=== HERE
 (...)
It checks that the representation of a set is different for two different PYTHONHASHSEED values (0 and 1). On my Fedora 34, I confirm that they are different:
PYTHONHASHSEED=0:
vstinner@apu$ PYTHONHASHSEED=0 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
{('string', 1), ('string', 2), ('string', 3)}
vstinner@apu$ PYTHONHASHSEED=0 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
{('string', 1), ('string', 2), ('string', 3)}
vstinner@apu$ PYTHONHASHSEED=0 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
{('string', 1), ('string', 2), ('string', 3)}
versus PYTHONHASHSEED=1:
vstinner@apu$ PYTHONHASHSEED=1 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
{('string', 3), ('string', 1), ('string', 2)}
vstinner@apu$ PYTHONHASHSEED=1 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
{('string', 3), ('string', 1), ('string', 2)}
vstinner@apu$ PYTHONHASHSEED=1 ./python -c "print(set([('string', 1), ('string', 2), ('string', 3)]))"
msg400993 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年09月03日 13:36
Thanks for finding this, Victor.
That failure is surprising to me. Is it really possible for the order of the elements in a set to vary based on platform or build configuration (even with a fixed PYTHONHASHSEED at runtime)?
Really, this looks like it’s only a bug in the test’s (read "my") assumptions, not really in marshal itself. I’m glad I added this little sanity check, though.
msg401000 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年09月03日 15:55
I'm compiling Clang now to try to reproduce using a UBSan build (I'm on Ubuntu, though).
I'm not entirely familiar with how these sanitizer builds work... could the implication be that we're hitting undefined behavior at some point? Or is it just a red herring?
Note also that the "set([float('nan'), b'a', b'b', b'c', 'x', 'y', 'z'])" and "frozenset([float('nan'), b'a', b'b', b'c', 'x', 'y', 'z'])" tests seem to be working just fine... meaning their ordering on this buildbot is different under PYTHONHASHSEEDs 0 and 1 (as expected). It may still be a platform-or-configuration-dependent ordering, though.
Raymond: off the top of your head, are there any obvious reasons this could be happening?
msg401002 - (view) Author: Brandt Bucher (brandtbucher) * (Python committer) Date: 2021年09月03日 16:26
Found it. This particular build is configured with HAVE_ALIGNED_REQUIRED=1, which forces it to use fnv instead siphash24 as its string hashing algorithm.
History
Date User Action Args
2022年04月11日 14:59:18adminsetgithub: 81777
2021年09月04日 14:20:01brandtbuchersetstatus: open -> closed
stage: patch review -> resolved
2021年09月03日 19:02:38brandtbuchersetstage: resolved -> patch review
pull_requests: + pull_request26586
2021年09月03日 16:26:21brandtbuchersetmessages: + msg401002
2021年09月03日 15:55:17brandtbuchersetassignee: brandtbucher
messages: + msg401000
2021年09月03日 13:36:38brandtbuchersetmessages: + msg400993
2021年09月03日 11:44:18vstinnersetmessages: + msg400989
2021年09月03日 11:27:30vstinnersetstatus: closed -> open

nosy: + vstinner
messages: + msg400988

resolution: fixed ->
2021年08月31日 16:18:43brandtbuchersetmessages: + msg400755
2021年08月30日 21:27:11gvanrossumsetnosy: + gvanrossum
messages: + msg400670
2021年08月30日 15:25:30brandtbuchersetpull_requests: + pull_request26512
2021年08月25日 14:14:29rhettingersetstatus: open -> closed

messages: + msg400270
2021年08月25日 13:37:33brandtbuchersetmessages: + msg400264
2021年08月25日 13:12:02rhettingersetstatus: closed -> open

messages: + msg400260
2021年08月25日 11:20:36lukasz.langasetstatus: open -> closed
resolution: fixed
messages: + msg400255

stage: patch review -> resolved
2021年08月25日 11:14:38lukasz.langasetnosy: + lukasz.langa
messages: + msg400254
2021年08月25日 09:18:40lukasz.langasetversions: + Python 3.11, - Python 3.9
2021年08月24日 04:48:15brandtbuchersetpull_requests: + pull_request26377
2021年08月24日 00:19:36rhettingersetmessages: - msg400182
2021年08月24日 00:19:22rhettingersetmessages: + msg400186
2021年08月23日 22:58:09rhettingersetmessages: + msg400182
2021年08月23日 22:22:51rhettingersetmessages: + msg400180
2021年08月23日 19:10:53brandtbuchersetmessages: + msg400166
2021年08月23日 18:06:45brandtbuchersetmessages: + msg400159
2021年08月23日 17:51:34serhiy.storchakasetmessages: + msg400156
2021年08月23日 17:37:53brandtbuchersetmessages: + msg400153
2021年08月23日 17:24:00brandtbuchersetnosy: + brandtbucher
2021年08月14日 15:48:00FFY00setkeywords: + patch
stage: patch review
pull_requests: + pull_request26244
2021年07月30日 16:03:28pablogsalsetmessages: + msg398565
2021年07月30日 16:00:20pablogsalsetnosy: + pablogsal
messages: + msg398563
2021年07月30日 10:51:14obfusksetnosy: + obfusk
2021年05月27日 09:23:08FFY00setmessages: + msg394524
2021年05月27日 06:47:13vstinnersetnosy: - vstinner
2021年05月27日 00:44:48methanesetmessages: + msg394501
2021年05月26日 12:48:31FFY00setmessages: + msg394431
2021年05月26日 07:57:03methanesetnosy: + methane
messages: + msg394419
2021年05月26日 04:07:14rhettingersetmessages: - msg394414
2021年05月26日 03:47:49rhettingersetmessages: + msg394414
2021年05月25日 16:39:52FFY00setmessages: + msg394377
2021年05月25日 15:57:25serhiy.storchakasetnosy: + serhiy.storchaka
messages: + msg394373
2021年05月25日 14:35:13FFY00setmessages: + msg394361
2021年05月25日 10:17:19vstinnersetmessages: + msg394311
2021年05月11日 16:35:22FFY00setmessages: + msg393465
2021年04月16日 04:44:50yan12125setnosy: - yan12125
2021年04月16日 00:13:23FFY00setmessages: + msg391159
2021年04月16日 00:12:47rhettingersetmessages: + msg391158
2021年04月16日 00:10:53FFY00setmessages: + msg391157
2021年04月15日 23:49:04rhettingersetnosy: + rhettinger
messages: + msg391156
2021年04月15日 02:38:55FFY00setnosy: + FFY00
messages: + msg391118
2021年02月03日 18:09:23steve.dowerunlinkissue29708 dependencies
2020年10月22日 20:39:28eric.araujolinkissue29708 dependencies
2020年04月10日 13:22:03yan12125setnosy: + yan12125
messages: + msg366124
2020年04月08日 12:47:20jefferytosetnosy: + jefferyto
2019年07月15日 15:05:11vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /