Issue 25823: Speed-up oparg decoding on little-endian machines

➜

This issue tracker has been migrated to GitHub , and is currently read-only.
For more information, see the GitHub FAQs in the Python's Developer Guide.

This issue has been migrated to GitHub: https://github.com/python/cpython/issues/70009

classification

Title:	Speed-up oparg decoding on little-endian machines
Type:	performance	Stage:	resolved
Components:	Interpreter Core	Versions:	Python 3.6

process

Status:	closed	Resolution:	out of date
Dependencies:	Superseder:	ceval: Wordcode follow up, explicit unsigned short read View: 27097
Assigned To:	serhiy.storchaka	Nosy List:	Demur Rumed, arigo, mark.dickinson, pitrou, rhettinger, serhiy.storchaka, skrah, tim.peters, vstinner
Priority:	normal	Keywords:	patch

Created on 2015年12月08日 09:34 by rhettinger, last changed 2022年04月11日 14:58 by admin. This issue is now closed.

Files
File name	Uploaded	Description	Edit
improve_arg_decoding.diff	rhettinger, 2015年12月08日 15:14	review
improve_arg_decoding2.diff	rhettinger, 2015年12月08日 17:57	Post-increment version	review
improve_arg_decoding3.diff	serhiy.storchaka, 2015年12月09日 09:53	review
exercise_oparg.py	rhettinger, 2015年12月11日 21:44	Simple timing

Messages (21)
msg256106 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2015年12月08日 09:34
On little-endian machines, the decoding of an oparg can be sped-up by using a single 16-bit pointer deference. Current decoding: leaq 2(%rcx), %rbp movzbl -1(%rbp), %eax movzbl -2(%rbp), %r14d sall 8,ドル %eax addl %eax, %r14d New decoding: leaq 2(%rdx), %r12 movzwl -2(%r12), %r8d The patch uses (unsigned short *) like the struct module does, but it could use uint16_t if necessary. If next_instr can be advanced after the lookup rather than before, the generated code would be tighter still (removing the data dependency and shortening the movzwl instruction to drop the offset byte): movzwl (%rdx), %r8d leaq 2(%rdx), %rbp
msg256108 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2015年12月08日 09:47
You have forgot to attach a patch.
msg256114 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2015年12月08日 15:14
Added patch.
msg256117 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2015年12月08日 15:45
In about a half cases next_instr points to unaligned 16-bit value. Not all platforms allow access to unaligned data. We need other test in additional to PY_LITTLE_ENDIAN to allow this optimization.
msg256118 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2015年12月08日 15:58
Hmm. Doesn't this introduce undefined behaviour? The new code looks like a violation of the strict aliasing rule. (Or do we compile with `-fno-strict-aliasing` or the like?)
msg256120 - (view)	Author: Stefan Krah (skrah) * (Python committer)	Date: 2015年12月08日 16:22
Usually unaligned accesses are believed to carry a big performance penalty, though rumor has it that for the latest generation of CPUs this is no longer an issue.
msg256146 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2015年12月09日 08:27
The latest patch still replaces valid C with code that has undefined behaviour. That's not good! Introducing undefined behaviour is something we should avoid without a really good reason. Benchmarks showing dramatic real-world speed improvements from this change might count as a good reason, of course. :-)
msg256149 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2015年12月09日 09:53
I think following patch doesn't introduce undefined behavior. With this patch GCC on 32-bit i386 Linux produces the same code as for simple unsigned short read. I don't know wherever the benefit worth such complication. I don't know wherever the patch can cause performance regression on other platforms or compilers.
msg256151 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2015年12月09日 10:11
> I think following patch doesn't introduce undefined behavior. Agreed. As I recall, the memcpy trick is a fairly standard way around this issue, for compilers that are smart enough to compile away the actual memcpy call. > I don't know wherever the patch can cause performance regression on other platforms or compilers. Me neither.
msg256152 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2015年12月09日 10:14
Benchmarks would help here: if none of the patches gives a significant real-world speedup, then the whole UB discussion is moot.
msg256236 - (view)	Author: Raymond Hettinger (rhettinger) * (Python committer)	Date: 2015年12月11日 21:44
I verified that Clang and GCC both give the expected disassembly with Serhiy's patch. We ought to restrict the #if to just the compilers that are known to optimize away the memcpy. Clang (for 'BUILD_LIST_UNPACK') ------------------------------- .loc 10 2525 9 ## Python/ceval.c:2525:9 movzwl (%r13), %r9d addq 2,ドル %r13 Ltmp2042: ##DEBUG_VALUE: PyEval_EvalFrameEx:next_instr <- R13 GCC (for 'BUILD_LIST_UNPACK') ----------------------------- LM1275: movzwl (%rdx), %r8d LVL1147: leaq 2(%rdx), %rbp [Mark] > Benchmarks showing dramatic real-world speed improvements ... Much of the doubling of speed for core Python that has occurred over the last ten decade has occurred one little step at a time, none of the them being individually "dramatic". In general, if we have a chance to reduce the work load in the ceval inner-loop, we should take it. A simple benchmark on clang shows a roughly 10+% speedup in code exercising simple and common opcodes that that have a oparg (there is no point of benchmarking the effect on opcodes like IMPORT_NAME where the total eval-loop overhead is already an insignificant proportion of the total work). Baseline version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81) $ ./python.exe exercise_oparg.py 0.22484053499647416 $ ./python.exe exercise_oparg.py 0.22687773499637842 $ ./python.exe exercise_oparg.py 0.22026274001109414 Patched version with CLANG Apple LLVM version 7.0.2 (clang-700.1.81) $ ./python.exe exercise_oparg.py 0.19516360601119231 $ ./python.exe exercise_oparg.py 0.20087355599389412 $ ./python.exe exercise_oparg.py 0.1980393300036667 To better isolate the effect, I suppose you could enable the READ_TIMESTAMP macros to precisely measure the effect of converting five sequentially dependent instructions with two independent instructions, but likely all it would show you is that the two are cheaper than the five.
msg256275 - (view)	Author: Armin Rigo (arigo) * (Python committer)	Date: 2015年12月12日 09:53
Fwiw, I made a trivial benchmark in C that loads aligned and misaligned shorts ( http://paste.pound-python.org/show/HwnbCI3Pqsj8bx25Yfwp/ ). It shows that the memcpy() version takes only 65% of the time taken by the two-bytes-loaded version on a 2010 laptop. It takes 75% of the time on a modern server. On a recent little-endian PowerPC machine, 96%. On aarch64, only 45% faster (i.e. more than twice faster). This is all with gcc. It seems that using memcpy() is definitely a win nowadays.
msg256276 - (view)	Author: Armin Rigo (arigo) * (Python committer)	Date: 2015年12月12日 09:54
(Typo: "only 45% faster" should be "only 45% of the time")
msg256277 - (view)	Author: Antoine Pitrou (pitrou) * (Python committer)	Date: 2015年12月12日 11:14
Or just bite the bullet and make all opcodes 16-bit (and make sure the bytecode string is aligned ;-)). Still, someone ought to run the benchmarks suite on this. Tuple-unpacking nano-benchmarks are not very enlightening ;-)
msg256283 - (view)	Author: Mark Dickinson (mark.dickinson) * (Python committer)	Date: 2015年12月12日 12:29
Raymond: I only used the word "dramatic" in the context of deliberately introducing new undefined behaviour into the core of CPython, which seems like something that should be avoided without a really good reason. I have no objections to Serhiy's patch, which doesn't introduce undefined behaviour.
msg256338 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2015年12月13日 20:31
> I verified that Clang and GCC both give the expected disassembly with Serhiy's patch. We ought to restrict the #if to just the compilers that are known to optimize away the memcpy. Yes, it makes sense. Note: Python already has Py_MEMCPY() to work around slow memcpy() for small copies with Visual Studio.
msg264221 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年04月26日 06:55
I there are no objections I'm inclined to push the patch in hope that this will make the Wordcode patch (issue26647) simpler and more efficient (yes, this will add more work for Demur for synchronization).
msg264254 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2016年04月26日 11:42
I dislike the usage of union to use fetch 16-bit but only in little endian. I would prefer to modify the PyCodeObject to ensure that co_code is aligned to 16-bit and use an uint16_t* pointer in ceval.c. It would be simpler no? In the worst case, we should overallocate 1 null byte in PyCodeObject to align bytecode to 16-bit.
msg264255 - (view)	Author: STINNER Victor (vstinner) * (Python committer)	Date: 2016年04月26日 11:44
"I there are no objections I'm inclined to push the patch in hope that this will make the Wordcode patch (issue26647) simpler and more efficient (yes, this will add more work for Demur for synchronization)." I would prefer to be kind with Demur and wait until his patch is merged (to not introduce conflicts). Wordcode change is quite big, whereas this one is short. Raymond already approved his patch on the python-dev mailing list, I waiting for the feedback of Yury and Serhiy until Sunday to push the wordcode change. IMHO it will be simpler to optimize the (oparg, opvalue) fetch in ceval.c after wordcode will be merged. What do you think?
msg264263 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年04月26日 12:10
I agree that implementing fast fetch 16-bit is simpler with wordcodes.
msg266376 - (view)	Author: Serhiy Storchaka (serhiy.storchaka) * (Python committer)	Date: 2016年05月25日 17:07
Corresponding patch was committed in issue27097.

History
Date	User	Action	Args
2022年04月11日 14:58:24	admin	set	github: 70009
2016年05月25日 17:07:00	serhiy.storchaka	set	status: open -> closed superseder: ceval: Wordcode follow up, explicit unsigned short read messages: + msg266376 dependencies: - ceval: use Wordcode, 16-bit bytecode resolution: out of date stage: resolved
2016年04月26日 12:10:10	serhiy.storchaka	set	dependencies: + ceval: use Wordcode, 16-bit bytecode messages: + msg264263
2016年04月26日 11:44:19	vstinner	set	messages: + msg264255
2016年04月26日 11:42:11	vstinner	set	messages: + msg264254
2016年04月26日 06:55:06	serhiy.storchaka	set	nosy: + Demur Rumed messages: + msg264221
2015年12月13日 20:31:16	vstinner	set	messages: + msg256338
2015年12月12日 12:29:20	mark.dickinson	set	messages: + msg256283
2015年12月12日 11:14:23	pitrou	set	nosy: + pitrou messages: + msg256277
2015年12月12日 09:54:59	arigo	set	messages: + msg256276
2015年12月12日 09:53:21	arigo	set	nosy: + arigo messages: + msg256275
2015年12月11日 21:44:18	rhettinger	set	files: + exercise_oparg.py messages: + msg256236
2015年12月09日 10:14:53	mark.dickinson	set	messages: + msg256152
2015年12月09日 10:11:52	mark.dickinson	set	messages: + msg256151
2015年12月09日 09:53:48	serhiy.storchaka	set	files: + improve_arg_decoding3.diff messages: + msg256149
2015年12月09日 08:27:36	mark.dickinson	set	messages: + msg256146
2015年12月08日 18:08:47	rhettinger	set	nosy: + tim.peters
2015年12月08日 17:57:45	rhettinger	set	files: + improve_arg_decoding2.diff
2015年12月08日 16:22:45	skrah	set	nosy: + skrah messages: + msg256120
2015年12月08日 15:58:51	mark.dickinson	set	messages: + msg256118
2015年12月08日 15:45:41	serhiy.storchaka	set	nosy: + mark.dickinson messages: + msg256117
2015年12月08日 15:16:49	vstinner	set	nosy: + vstinner
2015年12月08日 15:14:11	rhettinger	set	files: + improve_arg_decoding.diff keywords: + patch messages: + msg256114
2015年12月08日 09:47:36	serhiy.storchaka	set	messages: + msg256108
2015年12月08日 09:34:11	rhettinger	create

homepage