-
Notifications
You must be signed in to change notification settings - Fork 7.9k
Improve internal function argument parsing performance by reducing code bloat #18436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
...de bloat The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed. If we take a look at a simple function: ```c PHP_FUNCTION(twice) { zend_long foo; ZEND_PARSE_PARAMETERS_START(1, 1) Z_PARAM_LONG(foo) ZEND_PARSE_PARAMETERS_END(); RETURN_LONG(foo * 2); } ``` We obtain the following assembly on x86-64 in the non-cold blocks: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: sub 0ドルx10,%rsp <+8>: mov %fs:0x28,%r12 <+17>: mov %r12,0x8(%rsp) <+22>: mov 0x2c(%rdi),%r12d <+26>: cmp 0ドルx1,%r12d <+30>: jne 0x22beaf <zif_twice-3212257> <+36>: cmpb 0ドルx4,0x58(%rdi) <+40>: mov %rsi,%rbp <+43>: jne 0x53c2f0 <zif_twice+96> <+45>: mov 0x50(%rdi),%rax <+49>: add %rax,%rax <+52>: movl 0ドルx4,0x8(%rbp) <+59>: mov %rax,0x0(%rbp) <+63>: mov 0x8(%rsp),%rax <+68>: sub %fs:0x28,%rax <+77>: jne 0x53c312 <zif_twice+130> <+79>: add 0ドルx10,%rsp <+83>: pop %rbx <+84>: pop %rbp <+85>: pop %r12 <+87>: ret <+88>: nopl 0x0(%rax,%rax,1) <+96>: lea 0x50(%rdi),%rbx <+100>: mov %rsp,%rsi <+103>: mov 0ドルx1,%edx <+108>: mov %rbx,%rdi <+111>: call 0x620240 <zend_parse_arg_long_slow> <+116>: test %al,%al <+118>: je 0x22be96 <zif_twice.cold> <+124>: mov (%rsp),%rax <+128>: jmp 0x53c2c1 <zif_twice+49> <+130>: call 0x201050 <__stack_chk_fail@plt> ``` Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch: ```s <+0>: push %r12 <+2>: push %rbp <+3>: push %rbx <+4>: mov 0x2c(%rdi),%r12d <+8>: cmp 0ドルx1,%r12d <+12>: jne 0x22d482 <zif_twice-3205454> <+18>: cmpb 0ドルx4,0x58(%rdi) <+22>: mov %rsi,%rbp <+25>: jne 0x53be08 <zif_twice+56> <+27>: mov 0x50(%rdi),%rax <+31>: add %rax,%rax <+34>: movl 0ドルx4,0x8(%rbp) <+41>: mov %rax,0x0(%rbp) <+45>: pop %rbx <+46>: pop %rbp <+47>: pop %r12 <+49>: ret <+50>: nopw 0x0(%rax,%rax,1) <+56>: lea 0x50(%rdi),%rbx <+60>: mov 0ドルx1,%esi <+65>: mov %rbx,%rdi <+68>: call 0x61e7b0 <zend_parse_arg_long_slow> <+73>: test %dl,%dl <+75>: je 0x22d46a <zif_twice.cold> <+81>: jmp 0x53bdef <zif_twice+31> ``` The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT. For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well. | Test | Old Mean | Old Stddev | New Mean | New Stddev | |---------------------------------|----------|------------|----------|------------| | Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 | | Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 | | WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 | | WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 | I was not able to measure any meaningful difference for our micro benchmarks `Zend/bench.php` and `Zend/micro_bench.php`. The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
One potential optimization idea is to store a bool is_empty
instead of bool has_value
. Because in that case we have to store a 0 instead of a 1 for the "success path". This may be a bit faster on the x86 where setting the lower parts of some registers also clears the upper part.
When building on Arm, an uninitialized warning was triggered at the following location.
Although there doesn’t seem to be any uninitialized code path, it might be a good idea to set an initial value just to silence the warning.
Line 5514 in c919ab4
I also dumped the non-cold block assembly on Arm.
I already knew theoretically that it should have an effect on Arm as well, and it seems that is indeed the case.
before:
0: a9bc7bfd stp x29, x30, [sp, #-64]!
4: 90000002 adrp x2, 0 <__stack_chk_guard>
8: 910003fd mov x29, sp
c: f9400042 ldr x2, [x2]
10: a90153f3 stp x19, x20, [sp, #16]
14: aa0103f4 mov x20, x1
18: f90013f5 str x21, [sp, #32]
1c: b9402c15 ldr w21, [x0, #44]
20: f9400041 ldr x1, [x2]
24: f9001fe1 str x1, [sp, #56]
28: d2800001 mov x1, #0x0 // #0
2c: 710006bf cmp w21, #0x1
30: 54000401 b.ne b0 <zif_twice+0xb0> // b.any
34: 39416001 ldrb w1, [x0, #88]
38: 7100103f cmp w1, #0x4
3c: 54000221 b.ne 80 <zif_twice+0x80> // b.any
40: f9402800 ldr x0, [x0, #80]
44: d37ff800 lsl x0, x0, #1
48: 52800081 mov w1, #0x4 // #4
4c: f9000280 str x0, [x20]
50: b9000a81 str w1, [x20, #8]
54: 90000000 adrp x0, 0 <__stack_chk_guard>
58: f9400000 ldr x0, [x0]
5c: f9401fe2 ldr x2, [sp, #56]
60: f9400001 ldr x1, [x0]
64: eb010042 subs x2, x2, x1
68: d2800001 mov x1, #0x0 // #0
6c: 540001c1 b.ne a4 <zif_twice+0xa4> // b.any
70: a94153f3 ldp x19, x20, [sp, #16]
74: f94013f5 ldr x21, [sp, #32]
78: a8c47bfd ldp x29, x30, [sp], #64
7c: d65f03c0 ret
80: 91014013 add x19, x0, #0x50
84: 2a1503e2 mov w2, w21
88: aa1303e0 mov x0, x19
8c: 9100c3e1 add x1, sp, #0x30
90: 94000000 bl 0 <zend_parse_arg_long_slow>
94: 52800122 mov w2, #0x9 // #9
98: 72001c1f tst w0, #0xff
9c: 54000061 b.ne a8 <zif_twice+0xa8> // b.any
a0: 14000000 b 0 <zif_twice>
a4: 94000000 bl 0 <__stack_chk_fail>
a8: f9401be0 ldr x0, [sp, #48]
ac: 17ffffe6 b 44 <zif_twice+0x44>
b0: 14000000 b 0 <zif_twice>
after:
0: a9bd7bfd stp x29, x30, [sp, #-48]!
4: 910003fd mov x29, sp
8: f90013f5 str x21, [sp, #32]
c: b9402c15 ldr w21, [x0, #44]
10: a90153f3 stp x19, x20, [sp, #16]
14: 710006bf cmp w21, #0x1
18: 540003a1 b.ne 8c <zif_twice+0x8c> // b.any
1c: aa0103f4 mov x20, x1
20: 39416001 ldrb w1, [x0, #88]
24: 7100103f cmp w1, #0x4
28: 54000141 b.ne 50 <zif_twice+0x50> // b.any
2c: f9402800 ldr x0, [x0, #80]
30: d37ff800 lsl x0, x0, #1
34: 52800081 mov w1, #0x4 // #4
38: f94013f5 ldr x21, [sp, #32]
3c: f9000280 str x0, [x20]
40: b9000a81 str w1, [x20, #8]
44: a94153f3 ldp x19, x20, [sp, #16]
48: a8c37bfd ldp x29, x30, [sp], #48
4c: d65f03c0 ret
50: 91014013 add x19, x0, #0x50
54: 2a1503e1 mov w1, w21
58: aa1303e0 mov x0, x19
5c: 94000000 bl 0 <zend_parse_arg_long_slow>
60: 72001c3f tst w1, #0xff
64: 54fffe61 b.ne 30 <zif_twice+0x30> // b.any
68: 52800120 mov w0, #0x9 // #9
6c: aa1303e4 mov x4, x19
70: 2a1503e1 mov w1, w21
74: a94153f3 ldp x19, x20, [sp, #16]
78: 52800003 mov w3, #0x0 // #0
7c: f94013f5 ldr x21, [sp, #32]
80: d2800002 mov x2, #0x0 // #0
84: a8c37bfd ldp x29, x30, [sp], #48
88: 14000000 b 0 <zend_wrong_parameter_error>
8c: 14000000 b 0 <zif_twice>
Regarding is_empty
, the performance should theoretically be the same on Arm, so I think it’s fine to decide based on what’s best for x86.
@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall mstm but lgtm for pgsql at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.
I though about this feature long time ago (for general API optimization), but decided not to use it.
Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.
Linux 32-bit x86: https://gist.github.com/nielsdos/9074649120436c5459ab33e131d023c4
When comparing the assembly, it looks like the new code still returns the data via the stack and we still get the stack protector check. While we get one less instruction in the new code, we do use more stack space in the new code as well.
Windows x64: https://gist.github.com/nielsdos/e3fc03963ab6e39076da0d4f79115d8e
Assembly is very similar, but uses more stack space as the struct is returned via the stack, needs some extra instructions to copy the data over unfortunately.
Windows x86-32: https://gist.github.com/nielsdos/2091201b1255115f840aafd75872c3ac
Assembly is similar, but the new code now uses the values returned via registers instead of checking the boolean in "AL register" and then reading the zend_long from a stack slot. Fast path is the same size.
So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?
So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?
That's correct. The degradation I saw on 32-bit Windows is less severe than on 64-bit Windows.
I also tried wallclock time micro benchmarks but the execution time is too noisy to make any conclusions.
Uh oh!
There was an error while loading. Please reload this page.
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.
If we take a look at a simple function:
We obtain the following assembly on x86-64 in the non-cold blocks:
Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:
The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate
-T
argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.
I was not able to measure any meaningful difference for our micro benchmarks
Zend/bench.php
andZend/micro_bench.php
.The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).