homepage

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

classification
Title: Add tp_fastnew and tp_fastinit to PyTypeObject, 15-20% faster object instanciation
Type: performance Stage:
Components: Interpreter Core Versions: Python 3.7
process
Status: closed Resolution: rejected
Dependencies: Superseder:
Assigned To: Nosy List: josh.r, methane, python-dev, rhettinger, serhiy.storchaka, vstinner, yselivanov
Priority: normal Keywords:

Created on 2017年01月24日 08:25 by vstinner, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Pull Requests
URL Status Linked Edit
PR 72 vstinner, 2017年01月24日 08:25
Messages (10)
msg286151 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月24日 08:25
After #29259 "Add tp_fastcall to PyTypeObject: support FASTCALL calling convention for all callable objects", the two last slots which still use the (args: tuple, kwargs: dict) calling convention are tp_new and tp_init, two major slots to instanciate objects.
I implemented tp_fastnew/tp_fastinit on top of the issue #29259 pull request (tp_fastcall). The implementation is a WIP, it's just complete enough to start benchmarking the code.
Example of benchmarks on the two types currently optimized in my WIP fast_init branch, list and _asyncio.Future:
---
haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List()' --duplicate=100 --compare-to=../default-ref/python 
Median +- std dev: [ref] 81.9 ns +- 0.2 ns -> [fast_init] 69.3 ns +- 0.4 ns: 1.18x faster (-15%)
haypo@smithers$ ./python -m perf timeit -s 'List=list' 'List((1,2,3))' --duplicate=100 --compare-to=../default-ref/python 
Median +- std dev: [ref] 137 ns +- 6 ns -> [fast_init] 107 ns +- 0 ns: 1.28x faster (-22%)
haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python
Median +- std dev: [ref] 411 ns +- 20 ns -> [fast_init] 355 ns +- 18 ns: 1.16x faster (-14%)
---
The speedup of tp_fastnew + tp_fastinit is between 1.16x faster and 1.28x faster. The question is now if it is worth it.
Warning: The code is not fully optimized and is likely to have subtle bugs. The pull request is not ready for a review, but you may take a look if you would like to have an idea of the required changes. The most tricky changes are made in typeobject.c to support backward compatibility (call tp_new if tp_fastnew is not set) and stable API (support Python 3.6 PyTypeObject without the two slots).
Note: tp_fastnew and tp_fastinit slots are the expected end of my large FASTCALL optimization project.
msg286161 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月24日 09:24
Serhiy Storchaka: "Calling _PyStack_AsDict() with non-string or non-unique keywords means that the var-keyword argument was first unpacked to arrays of keyword names and values and then converted back to a dict. Seems something is done non-efficiently." (msg286159 of the issue #29360)
Python code from test_dict:
 dict(**invalid)
Bytecode:
 LOAD_GLOBAL 1 (dict)
 BUILD_TUPLE 0
 LOAD_FAST 1 (invalid)
 CALL_FUNCTION_EX 1
Call stack:
* _PyEval_EvalFrameDefault()
* do_call_core()
* PyObject_Call()
* _Py_RawFastCallDict() -- conversion from dict to stack+kwnames
* type_call()
* call_init()
* _PyStack_AsDict() -- convertsion from stack+kwnames to dict
Oh right, there are two conversions using a temporary FASTCALL format for (keyword) arguments.
This code was not upstream yet, it comes from the pull request of this issue.
Maybe we need two flavors of type_call(): type_call(args: tuple, kwargs: tuple) if tp_fastinit isn't set, type_fastcall(stack, nargs, kwnames) (FASTCALL) if tp_fastinit is set.
But it seems that the logic should be implemented in PyObject_Call() and _PyObject_FastCallDict(), it cannot be implemented in type_call(), it's too late.
For best performances (avoid any kind of conversion and avoid any temporary object), we should implement something like that.
msg286203 - (view) Author: Yury Selivanov (yselivanov) * (Python committer) Date: 2017年01月24日 18:50
Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement.
msg286207 - (view) Author: Josh Rosenberg (josh.r) * (Python triager) Date: 2017年01月24日 19:30
So just to be clear, the issue with non-string or non-unique keywords is purely about performance, right, not correctness/non-crashiness? Non-string keywords, while technically accepted by CPython, are at best barely legal by the language standard. The language standard for calls specifies that even using the **expression syntax "expression must evaluate to a mapping, the contents of which are treated as additional keyword arguments" and keywords_arguments only allows identifier=expression, where identifier is a legal variable name (and therefore, a str).
If misuse of ** unpacking is slower, but still works, and correct usage is significantly faster, I wouldn't consider an inability to fix the performance for the misuse case a blocking issue; nice to have, but making common, good code faster is worth making rare, bad code slower.
msg286216 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月24日 21:03
First benchmark result on the fast_init compared to the default branch. Warning: the patch is still a WIP, there are maybe performance issues (in this case, this benchmark should help to identify them).
---
haypo@speed-python$ python3 -m perf compare_to 2017年01月23日_14-02-default-cebc9c7ad195.json fast_init-0_ref_cebc9c7ad195.json -G --min-speed=2
Slower (6):
- telco: 13.0 ms +- 0.2 ms -> 13.8 ms +- 0.3 ms: 1.06x slower (+6%)
- scimark_sor: 391 ms +- 7 ms -> 413 ms +- 11 ms: 1.06x slower (+6%)
- fannkuch: 938 ms +- 7 ms -> 973 ms +- 13 ms: 1.04x slower (+4%)
- pickle_dict: 53.7 us +- 3.4 us -> 55.5 us +- 5.0 us: 1.03x slower (+3%)
- sqlite_synth: 8.00 us +- 0.12 us -> 8.25 us +- 0.14 us: 1.03x slower (+3%)
- scimark_monte_carlo: 202 ms +- 5 ms -> 207 ms +- 5 ms: 1.03x slower (+3%)
Faster (19):
- scimark_lu: 362 ms +- 13 ms -> 335 ms +- 17 ms: 1.08x faster (-8%)
- chameleon: 23.5 ms +- 0.3 ms -> 21.7 ms +- 0.3 ms: 1.08x faster (-7%)
- xml_etree_process: 179 ms +- 3 ms -> 168 ms +- 3 ms: 1.07x faster (-6%)
- pickle_pure_python: 1.03 ms +- 0.02 ms -> 982 us +- 11 us: 1.05x faster (-5%)
- regex_dna: 283 ms +- 1 ms -> 271 ms +- 1 ms: 1.05x faster (-4%)
- chaos: 239 ms +- 2 ms -> 230 ms +- 3 ms: 1.04x faster (-4%)
- django_template: 356 ms +- 5 ms -> 343 ms +- 5 ms: 1.04x faster (-4%)
- call_simple: 10.7 ms +- 0.4 ms -> 10.3 ms +- 0.3 ms: 1.04x faster (-3%)
- hexiom: 18.0 ms +- 0.1 ms -> 17.4 ms +- 0.1 ms: 1.04x faster (-3%)
- pathlib: 42.3 ms +- 0.5 ms -> 40.9 ms +- 0.4 ms: 1.03x faster (-3%)
- regex_compile: 379 ms +- 2 ms -> 367 ms +- 4 ms: 1.03x faster (-3%)
- go: 501 ms +- 7 ms -> 485 ms +- 4 ms: 1.03x faster (-3%)
- xml_etree_generate: 208 ms +- 4 ms -> 202 ms +- 3 ms: 1.03x faster (-3%)
- deltablue: 14.5 ms +- 0.3 ms -> 14.1 ms +- 0.2 ms: 1.03x faster (-3%)
- unpickle_pure_python: 652 us +- 8 us -> 634 us +- 9 us: 1.03x faster (-3%)
- richards: 147 ms +- 2 ms -> 143 ms +- 4 ms: 1.03x faster (-3%)
- nqueens: 213 ms +- 2 ms -> 208 ms +- 2 ms: 1.03x faster (-2%)
- raytrace: 1.08 sec +- 0.01 sec -> 1.05 sec +- 0.02 sec: 1.02x faster (-2%)
- unpickle: 31.0 us +- 0.4 us -> 30.3 us +- 1.0 us: 1.02x faster (-2%)
Benchmark hidden because not significant (34): (...)
Ignored benchmarks (5) of 2017年01月23日_14-02-default-cebc9c7ad195.json: mako, sympy_expand, sympy_integrate, sympy_str, sympy_sum
---
Reminder: The fast_init branch is based on the issue #29259 (tp_fastcall), so the result combines tp_fastcall and tp_fastnew/fastinit.
msg286271 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月25日 17:42
I implemented more optimizations:
* type_call() uses FASTCALL
* object_new() and object_init() use FASTCALL
* _asyncio.Future new & init use FASTCALL
As a result, create an _asyncio.Future is now 1.45x faster:
haypo@smithers$ ./python -m perf timeit -s 'import _asyncio, asyncio; Future=_asyncio.Future; loop=asyncio.get_event_loop()' 'Future(loop=loop)' --compare-to=../default-ref/python 
Median +- std dev: [ref] 411 ns +- 8 ns -> [fast_init] 283 ns +- 10 ns: 1.45x faster (-31%)
Yury> Great. Can you share the Richards benchmark results? It specifically stresses class instantiation, so we should be able to see the improvement.
While bm_richards.py doesn't call _PyStack_AsTuple() anyone in the benchmark (it seems like it full uses FASTCALL), it seems like there is simply zero impact on performance :-(
$ ./python -m perf compare_to richards_ref.json richards_fastinit.json 
Benchmark hidden because not significant (1): richards
msg286290 - (view) Author: Roundup Robot (python-dev) (Python triager) Date: 2017年01月26日 02:18
New changeset 8feb6a5ce4c6 by Victor Stinner in branch 'default':
Issue #29358: Add postcondition checks on types
https://hg.python.org/cpython/rev/8feb6a5ce4c6 
msg286315 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月26日 14:14
While they are obvious speedup on microbenchmarks, it doesn't speedup "macro" benchmarks from performance as much as I expected.
The changes required in typeobject.c to support "tp_new or tp_fastnew" and "tp_init or tp_fastinit" are large and very complex. I'm not sure that my code is correct. update_one_slot(), add_operators() and PyType_Ready() functions are complex beast, sadly almost not documented, except such comment...
 /* The __new__ wrapper is not a wrapper descriptor,
 so must be special-cased differently.
 If we don't do this, creating an instance will
 always use slot_tp_new which will look up
 __new__ in the MRO which will call tp_new_wrapper
 which will look through the base classes looking
 for a static base and call its tp_new (usually
 PyType_GenericNew), after performing various
 sanity checks and constructing a new argument
 list. Cut all that nonsense short -- this speeds
 up instance creation tremendously. */
 specific = (void *)type->tp_new;
 /* XXX I'm not 100% sure that there isn't a hole
 in this reasoning that requires additional
 sanity checks. I'll buy the first person to
 point out a bug in this reasoning a beer. */
How am I supposed to be confident in front of such coment :-D
I expect that calling a functions (tp_call) is a more common operation than instanciating a new object (tp_new + tp_init). So I don't think that the overall change is worth it.
For all these reaons, I close the issue as REJECTED.
msg286316 - (view) Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer) Date: 2017年01月26日 14:52
As for update_one_slot() see also issue5322 and issue25731.
msg286317 - (view) Author: STINNER Victor (vstinner) * (Python committer) Date: 2017年01月26日 14:58
Serhiy Storchaka:
> As for update_one_slot() see also issue5322 and issue25731.
Oh, thanks for the pointers! Now I understand much better these bugs.
I'm quite sure that they are still flaws in this code when a type is modified after PyType_Ready(), like sysmodule.c::
 /* prevent user from creating new instances */
 FlagsType.tp_init = NULL;
 FlagsType.tp_new = NULL;
But each time I had to dig into typeobject.c, my head is going to explode :-D I may try to understand one more time, but not today ;-)
History
Date User Action Args
2022年04月11日 14:58:42adminsetgithub: 73544
2017年01月26日 14:58:30vstinnersetmessages: + msg286317
2017年01月26日 14:52:42serhiy.storchakasetmessages: + msg286316
2017年01月26日 14:14:22vstinnersetstatus: open -> closed
resolution: rejected
messages: + msg286315
2017年01月26日 02:18:06python-devsetnosy: + python-dev
messages: + msg286290
2017年01月25日 17:42:16vstinnersetmessages: + msg286271
2017年01月24日 21:03:35vstinnersetmessages: + msg286216
2017年01月24日 19:30:34josh.rsetnosy: + josh.r
messages: + msg286207
2017年01月24日 18:50:12yselivanovsetmessages: + msg286203
2017年01月24日 09:24:34vstinnersetmessages: + msg286161
2017年01月24日 08:25:29vstinnercreate

AltStyle によって変換されたページ (->オリジナル) /