homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: distutils is not reproducible
Type: Stage: patch review
Components: Library (Lib) Versions: Python 3.8
process
Status: open Resolution:
Dependencies: 31377 34093 Superseder:
Assigned To: Nosy List: benjamin.peterson, bmwiedemann, jefferyto, methane, petr.viktorin, sascha_silbe, vstinner, yan12125, zbysz
Priority: normal Keywords: patch

Created on 2018年07月03日 15:46 by vstinner, last changed 2022年04月11日 14:59 by admin.

Pull Requests
URL Status Linked Edit
PR 8057 closed vstinner, 2018年07月03日 15:47
PR 8226 open methane, 2018年07月10日 12:23
Messages (9)
msg320988 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年07月03日 15:46
Follow up of bpo-29708: OpenSUSE uses a downstream patch for distutils to fix https://bugzilla.opensuse.org/show_bug.cgi?id=1049186: distutils-reproducible-compile.patch. I converted the patch as a PR: PR 8057.
Naoki INADA wrote:
"""
Currently, marshal uses refcnt to determine using w_ref or not. Some immutable objects (especially, long and str) can be cached and reused. It may affects refcnt when byte compiling.
I think we should use more deterministic way instead of refcnt. Maybe, count all constants in the module before marshal, like we did in compiling function for co_consts and co_names.
As a bonus, it may reduce resource usage too by merging constants over functions.
(e.g. ('self',) co_varnames and (None,) co_consts)
"""
https://github.com/python/cpython/pull/8057#issuecomment-402065657
Serhiy Storchaka added:
"""
I think we need to understand the issue better before committing changes. When found the source of unstability of file names, we can find other similar sources and make them stable too. For example if the source is listdir() or glob(), we can consider sorting results of all listdir() or glob() in distutils and related methods.
On other side, if the problem is with reference counters in marshal, we can change the marshal module instead.
"""
https://github.com/python/cpython/pull/8057#issuecomment-402198390 
msg320990 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年07月03日 15:47
Copy of https://bugzilla.opensuse.org/show_bug.cgi?id=1049186 first message:
"""
e.g. python-simplejson has one-bit diffs in .pyc files
See
http://rb.zq1.de/compare.factory-20170713/python-simplejson-compare.out
in python3-simplejson.rpm we get
-00004e50 68 6f 72 5f 5f da 07 64 65 63 69 6d 61 6c 72 0c |hor__..decimalr.|
+00004e50 68 6f 72 5f 5f 5a 07 64 65 63 69 6d 61 6c 72 0c |hor__Z.decimalr.|
in python3-simplejson-test.rpm we get the opposite change
-00000580 72 13 00 00 00 5a 07 64 65 63 69 6d 61 6c 72 03 |r....Z.decimalr.|
+00000580 72 13 00 00 00 da 07 64 65 63 69 6d 61 6c 72 03 |r......decimalr.|
and it seems to be related to filesystem ordering, since it built reproducibly
when using a filesystem with sorted readdir
using disorderfs via reproducible-faketools-filesys from
https://build.opensuse.org/package/show/home:bmwiedemann:reproducible/reproducible-faketools
"""
https://bugzilla.opensuse.org/show_bug.cgi?id=1049186#c0 
msg320991 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018年07月03日 15:50
I agree that we should fix the underlying issue (marshal) rather than papering over it by sorting. In fact, we should have a test that compiles a bunch of pycs in a random orders and sees if they're the same or not.
msg321383 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018年07月10日 12:14
Is this issue for only known marshal issue?
Or is this issue for all issues in distutils including unknowns?
msg321408 - (view) Author: Benjamin Peterson (benjamin.peterson) * (Python committer) Date: 2018年07月11日 04:39
We should probably discuss the marshal issue in the preëxisting #31377.
I'm not sure if "distutils is not reproducible" is a larger issue than "pyc compilation is not reproducible". This issue could be a meta issue for either.
msg321432 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2018年07月11日 10:33
> Is this issue for only known marshal issue?
IMHO the order in which .pyc files are created on disk also matters. It changes the result of "os.listdir()": some application can rely on unsorted os.listdir(). sorted() seems simple and hardless compared to the benefit.
msg321434 - (view) Author: Inada Naoki (methane) * (Python committer) Date: 2018年07月11日 10:37
OK, I created sub issue for pyc.
msg337975 - (view) Author: Bernhard M. Wiedemann (bmwiedemann) * Date: 2019年03月15日 08:58
unreproducible .pyc files are still one of the major headaches for my work on openSUSE reproducible builds.
There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).
So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?
Can I help something there?
Is there an ETA?
msg359595 - (view) Author: Petr Viktorin (petr.viktorin) * (Python committer) Date: 2020年01月08日 14:05
> There is also one aspect where i586 builds end up with different .pyc files than x86_64 builds. And then we randomly chose one of them for our "noarch" python module packages and hope they work everywhere (including on arm and s390 architectures).
They are functionally identical, despite not being bit-by-bit identical.
If they do not work everywhere, it's a very serious bug.
> So is someone working towards a concept that makes it is possible to create the same .pyc files anywhere?
No, it's a known issue no one is working on.
> Can I help something there?
Maybe?
The two main culprits are in the marshal serialization algorithm: https://github.com/python/cpython/blob/master/Python/marshal.c
Specifically:
- a heuristic depends on refcount (i.e. state of objects in the entire interpreter, rather than just relationships between serialized objects): https://github.com/python/cpython/blob/33b671e72450bf4b5a946ce0dde6b7fe21150108/Python/marshal.c#L304
- (frozen)sets are serialized in iteration order, which is unpredictable (and determinig a predictable order is not trivial): https://github.com/python/cpython/blob/33b671e72450bf4b5a946ce0dde6b7fe21150108/Python/marshal.c#L498
A solution will probably come with an unacceptable performance hit -- it's good to keep generating the .pyc files fast. Two options to overcome that come to mind:
- make reproducibility optional (which would make the testing more cumbersome)
- make an add-on tool to re-serialize an existing .pyc.
History
Date User Action Args
2022年04月11日 14:59:02adminsetgithub: 78214
2020年04月10日 13:23:09yan12125setnosy: + yan12125
2020年04月08日 12:50:37jefferytosetnosy: + jefferyto
2020年02月24日 16:35:26mceplsetnosy: - mcepl
2020年01月08日 14:05:06petr.viktorinsetnosy: + petr.viktorin
messages: + msg359595
2019年03月15日 08:58:26bmwiedemannsetnosy: + bmwiedemann
messages: + msg337975
2019年03月06日 15:46:44zbyszsetnosy: + zbysz
2018年11月13日 13:29:54sascha_silbesetnosy: + sascha_silbe
2018年07月11日 10:37:20methanesetdependencies: + remove *_INTERNED opcodes from marshal, Reproducible pyc: FLAG_REF is not stable.
messages: + msg321434
2018年07月11日 10:33:20vstinnersetmessages: + msg321432
2018年07月11日 04:39:09benjamin.petersonsetmessages: + msg321408
2018年07月10日 12:23:27methanesetpull_requests: + pull_request7764
2018年07月10日 12:14:36methanesetnosy: + methane
messages: + msg321383
2018年07月04日 23:27:10mceplsetnosy: + mcepl
2018年07月03日 15:50:22benjamin.petersonsetnosy: + benjamin.peterson
messages: + msg320991
2018年07月03日 15:47:56vstinnersetmessages: + msg320990
2018年07月03日 15:47:04vstinnersetkeywords: + patch
stage: patch review
pull_requests: + pull_request7677
2018年07月03日 15:46:25vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /