Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve internal function argument parsing performance by reducing code bloat #18436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nielsdos wants to merge 2 commits into php:master
base: master
Choose a base branch
Loading
from nielsdos:zend-parse-arg-improve-1

Conversation

Copy link
Member

@nielsdos nielsdos commented Apr 26, 2025
edited
Loading

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.

If we take a look at a simple function:

PHP_FUNCTION(twice) {
 zend_long foo;
 ZEND_PARSE_PARAMETERS_START(1, 1)
 Z_PARAM_LONG(foo)
 ZEND_PARSE_PARAMETERS_END();
 RETURN_LONG(foo * 2);
}

We obtain the following assembly on x86-64 in the non-cold blocks:

<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	sub $0x10,%rsp
<+8>:	mov %fs:0x28,%r12
<+17>:	mov %r12,0x8(%rsp)
<+22>:	mov0x2c(%rdi),%r12d
<+26>:	cmp $0x1,%r12d
<+30>:	jne0x22beaf <zif_twice-3212257>
<+36>:	cmpb $0x4,0x58(%rdi)
<+40>:	mov %rsi,%rbp
<+43>:	jne0x53c2f0 <zif_twice+96>
<+45>:	mov0x50(%rdi),%rax
<+49>:	add %rax,%rax
<+52>:	movl $0x4,0x8(%rbp)
<+59>:	mov %rax,0x0(%rbp)
<+63>:	mov0x8(%rsp),%rax
<+68>:	sub %fs:0x28,%rax
<+77>:	jne0x53c312 <zif_twice+130>
<+79>:	add $0x10,%rsp
<+83>:	pop %rbx
<+84>:	pop %rbp
<+85>:	pop %r12
<+87>:	ret
<+88>:	nopl 0x0(%rax,%rax,1)
<+96>:	lea0x50(%rdi),%rbx
<+100>:	mov %rsp,%rsi
<+103>:	mov $0x1,%edx
<+108>:	mov %rbx,%rdi
<+111>:	call0x620240 <zend_parse_arg_long_slow>
<+116>:	test %al,%al
<+118>:	je0x22be96 <zif_twice.cold>
<+124>:	mov (%rsp),%rax
<+128>:	jmp0x53c2c1 <zif_twice+49>
<+130>:	call0x201050 <__stack_chk_fail@plt>

Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:

<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	mov0x2c(%rdi),%r12d
<+8>:	cmp $0x1,%r12d
<+12>:	jne0x22d482 <zif_twice-3205454>
<+18>:	cmpb $0x4,0x58(%rdi)
<+22>:	mov %rsi,%rbp
<+25>:	jne0x53be08 <zif_twice+56>
<+27>:	mov0x50(%rdi),%rax
<+31>:	add %rax,%rax
<+34>:	movl $0x4,0x8(%rbp)
<+41>:	mov %rax,0x0(%rbp)
<+45>:	pop %rbx
<+46>:	pop %rbp
<+47>:	pop %r12
<+49>:	ret
<+50>:	nopw 0x0(%rax,%rax,1)
<+56>:	lea0x50(%rdi),%rbx
<+60>:	mov $0x1,%esi
<+65>:	mov %rbx,%rdi
<+68>:	call0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test %dl,%dl
<+75>:	je0x22d46a <zif_twice.cold>
<+81>:	jmp0x53bdef <zif_twice+31>

The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate -T argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.

Test Old Mean Old Stddev New Mean New Stddev
Symfony, no JIT (-T10,50) 0.5324 0.0050 0.5272 0.0042
Symfony, tracing JIT (-T10,50) 0.5301 0.0029 0.5264 0.0036
WordPress, no JIT (-T5,25) 0.7408 0.0049 0.7404 0.0048
WordPress, tracing JIT (-T5,25) 0.6814 0.0052 0.6770 0.0055

I was not able to measure any meaningful difference for our micro benchmarks Zend/bench.php and Zend/micro_bench.php.

The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

jorgsowa, KennedyTedesco, ddevsr, staabm, javiereguiluz, arnaud-lb, and medabkari reacted with rocket emoji
...de bloat
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.
If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
 zend_long foo;
 ZEND_PARSE_PARAMETERS_START(1, 1)
 Z_PARAM_LONG(foo)
 ZEND_PARSE_PARAMETERS_END();
 RETURN_LONG(foo * 2);
}
```
We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	sub 0ドルx10,%rsp
<+8>:	mov %fs:0x28,%r12
<+17>:	mov %r12,0x8(%rsp)
<+22>:	mov 0x2c(%rdi),%r12d
<+26>:	cmp 0ドルx1,%r12d
<+30>:	jne 0x22beaf <zif_twice-3212257>
<+36>:	cmpb 0ドルx4,0x58(%rdi)
<+40>:	mov %rsi,%rbp
<+43>:	jne 0x53c2f0 <zif_twice+96>
<+45>:	mov 0x50(%rdi),%rax
<+49>:	add %rax,%rax
<+52>:	movl 0ドルx4,0x8(%rbp)
<+59>:	mov %rax,0x0(%rbp)
<+63>:	mov 0x8(%rsp),%rax
<+68>:	sub %fs:0x28,%rax
<+77>:	jne 0x53c312 <zif_twice+130>
<+79>:	add 0ドルx10,%rsp
<+83>:	pop %rbx
<+84>:	pop %rbp
<+85>:	pop %r12
<+87>:	ret
<+88>:	nopl 0x0(%rax,%rax,1)
<+96>:	lea 0x50(%rdi),%rbx
<+100>:	mov %rsp,%rsi
<+103>:	mov 0ドルx1,%edx
<+108>:	mov %rbx,%rdi
<+111>:	call 0x620240 <zend_parse_arg_long_slow>
<+116>:	test %al,%al
<+118>:	je 0x22be96 <zif_twice.cold>
<+124>:	mov (%rsp),%rax
<+128>:	jmp 0x53c2c1 <zif_twice+49>
<+130>:	call 0x201050 <__stack_chk_fail@plt>
```
Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	mov 0x2c(%rdi),%r12d
<+8>:	cmp 0ドルx1,%r12d
<+12>:	jne 0x22d482 <zif_twice-3205454>
<+18>:	cmpb 0ドルx4,0x58(%rdi)
<+22>:	mov %rsi,%rbp
<+25>:	jne 0x53be08 <zif_twice+56>
<+27>:	mov 0x50(%rdi),%rax
<+31>:	add %rax,%rax
<+34>:	movl 0ドルx4,0x8(%rbp)
<+41>:	mov %rax,0x0(%rbp)
<+45>:	pop %rbx
<+46>:	pop %rbp
<+47>:	pop %r12
<+49>:	ret
<+50>:	nopw 0x0(%rax,%rax,1)
<+56>:	lea 0x50(%rdi),%rbx
<+60>:	mov 0ドルx1,%esi
<+65>:	mov %rbx,%rdi
<+68>:	call 0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test %dl,%dl
<+75>:	je 0x22d46a <zif_twice.cold>
<+81>:	jmp 0x53bdef <zif_twice+31>
```
The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
| Test | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 |
| Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 |
| WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 |
| WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 |
I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.
The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).
Copy link
Member Author

One potential optimization idea is to store a bool is_empty instead of bool has_value. Because in that case we have to store a 0 instead of a 1 for the "success path". This may be a bit faster on the x86 where setting the lower parts of some registers also clears the upper part.

@Girgias Girgias removed their request for review April 26, 2025 20:22
Copy link
Member

When building on Arm, an uninitialized warning was triggered at the following location.
Although there doesn’t seem to be any uninitialized code path, it might be a good idea to set an initial value just to silence the warning.

N = (calc_sunset ? h_set : h_rise) + gmt_offset;

I also dumped the non-cold block assembly on Arm.
I already knew theoretically that it should have an effect on Arm as well, and it seems that is indeed the case.

before:

 0: a9bc7bfd stp x29, x30, [sp, #-64]!
 4: 90000002 adrp x2, 0 <__stack_chk_guard>
 8: 910003fd mov x29, sp
 c: f9400042 ldr x2, [x2]
 10: a90153f3 stp x19, x20, [sp, #16]
 14: aa0103f4 mov x20, x1
 18: f90013f5 str x21, [sp, #32]
 1c: b9402c15 ldr w21, [x0, #44]
 20: f9400041 ldr x1, [x2]
 24: f9001fe1 str x1, [sp, #56]
 28: d2800001 mov x1, #0x0 // #0
 2c: 710006bf cmp w21, #0x1
 30: 54000401 b.ne b0 <zif_twice+0xb0> // b.any
 34: 39416001 ldrb w1, [x0, #88]
 38: 7100103f cmp w1, #0x4
 3c: 54000221 b.ne 80 <zif_twice+0x80> // b.any
 40: f9402800 ldr x0, [x0, #80]
 44: d37ff800 lsl x0, x0, #1
 48: 52800081 mov w1, #0x4 // #4
 4c: f9000280 str x0, [x20]
 50: b9000a81 str w1, [x20, #8]
 54: 90000000 adrp x0, 0 <__stack_chk_guard>
 58: f9400000 ldr x0, [x0]
 5c: f9401fe2 ldr x2, [sp, #56]
 60: f9400001 ldr x1, [x0]
 64: eb010042 subs x2, x2, x1
 68: d2800001 mov x1, #0x0 // #0
 6c: 540001c1 b.ne a4 <zif_twice+0xa4> // b.any
 70: a94153f3 ldp x19, x20, [sp, #16]
 74: f94013f5 ldr x21, [sp, #32]
 78: a8c47bfd ldp x29, x30, [sp], #64
 7c: d65f03c0 ret
 80: 91014013 add x19, x0, #0x50
 84: 2a1503e2 mov w2, w21
 88: aa1303e0 mov x0, x19
 8c: 9100c3e1 add x1, sp, #0x30
 90: 94000000 bl 0 <zend_parse_arg_long_slow>
 94: 52800122 mov w2, #0x9 // #9
 98: 72001c1f tst w0, #0xff
 9c: 54000061 b.ne a8 <zif_twice+0xa8> // b.any
 a0: 14000000 b 0 <zif_twice>
 a4: 94000000 bl 0 <__stack_chk_fail>
 a8: f9401be0 ldr x0, [sp, #48]
 ac: 17ffffe6 b 44 <zif_twice+0x44>
 b0: 14000000 b 0 <zif_twice>

after:

 0: a9bd7bfd stp x29, x30, [sp, #-48]!
 4: 910003fd mov x29, sp
 8: f90013f5 str x21, [sp, #32]
 c: b9402c15 ldr w21, [x0, #44]
 10: a90153f3 stp x19, x20, [sp, #16]
 14: 710006bf cmp w21, #0x1
 18: 540003a1 b.ne 8c <zif_twice+0x8c> // b.any
 1c: aa0103f4 mov x20, x1
 20: 39416001 ldrb w1, [x0, #88]
 24: 7100103f cmp w1, #0x4
 28: 54000141 b.ne 50 <zif_twice+0x50> // b.any
 2c: f9402800 ldr x0, [x0, #80]
 30: d37ff800 lsl x0, x0, #1
 34: 52800081 mov w1, #0x4 // #4
 38: f94013f5 ldr x21, [sp, #32]
 3c: f9000280 str x0, [x20]
 40: b9000a81 str w1, [x20, #8]
 44: a94153f3 ldp x19, x20, [sp, #16]
 48: a8c37bfd ldp x29, x30, [sp], #48
 4c: d65f03c0 ret
 50: 91014013 add x19, x0, #0x50
 54: 2a1503e1 mov w1, w21
 58: aa1303e0 mov x0, x19
 5c: 94000000 bl 0 <zend_parse_arg_long_slow>
 60: 72001c3f tst w1, #0xff
 64: 54fffe61 b.ne 30 <zif_twice+0x30> // b.any
 68: 52800120 mov w0, #0x9 // #9
 6c: aa1303e4 mov x4, x19
 70: 2a1503e1 mov w1, w21
 74: a94153f3 ldp x19, x20, [sp, #16]
 78: 52800003 mov w3, #0x0 // #0
 7c: f94013f5 ldr x21, [sp, #32]
 80: d2800002 mov x2, #0x0 // #0
 84: a8c37bfd ldp x29, x30, [sp], #48
 88: 14000000 b 0 <zend_wrong_parameter_error>
 8c: 14000000 b 0 <zif_twice>

Regarding is_empty, the performance should theoretically be the same on Arm, so I think it’s fine to decide based on what’s best for x86.

Copy link
Member Author

@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning.

SakiTakamachi reacted with thumbs up emoji

Copy link
Member

@devnexen devnexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall mstm but lgtm for pgsql at least

Copy link
Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.

I though about this feature long time ago (for general API optimization), but decided not to use it.

Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.

Copy link
Member Author

nielsdos commented May 4, 2025

Linux 32-bit x86: https://gist.github.com/nielsdos/9074649120436c5459ab33e131d023c4
When comparing the assembly, it looks like the new code still returns the data via the stack and we still get the stack protector check. While we get one less instruction in the new code, we do use more stack space in the new code as well.

Windows x64: https://gist.github.com/nielsdos/e3fc03963ab6e39076da0d4f79115d8e
Assembly is very similar, but uses more stack space as the struct is returned via the stack, needs some extra instructions to copy the data over unfortunately.

Windows x86-32: https://gist.github.com/nielsdos/2091201b1255115f840aafd75872c3ac
Assembly is similar, but the new code now uses the values returned via registers instead of checking the boolean in "AL register" and then reading the zend_long from a stack slot. Fast path is the same size.

Copy link
Member

dstogov commented May 5, 2025

So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?

Copy link
Member Author

nielsdos commented May 5, 2025
edited
Loading

So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?

That's correct. The degradation I saw on 32-bit Windows is less severe than on 64-bit Windows.
I also tried wallclock time micro benchmarks but the execution time is too noisy to make any conclusions.

Girgias added a commit to Girgias/php-src that referenced this pull request Aug 4, 2025
This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.
Girgias added a commit to Girgias/php-src that referenced this pull request Aug 4, 2025
This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.
Girgias added a commit to Girgias/php-src that referenced this pull request Aug 4, 2025
This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Reviewers

@dstogov dstogov dstogov left review comments

@devnexen devnexen devnexen approved these changes

@SakiTakamachi SakiTakamachi SakiTakamachi approved these changes

@bukka bukka Awaiting requested review from bukka bukka is a code owner

@alexdowad alexdowad Awaiting requested review from alexdowad alexdowad is a code owner

@youkidearitai youkidearitai Awaiting requested review from youkidearitai youkidearitai is a code owner

@arnaud-lb arnaud-lb Awaiting requested review from arnaud-lb

@iluuu1994 iluuu1994 Awaiting requested review from iluuu1994

@derickr derickr Awaiting requested review from derickr derickr is a code owner

Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

AltStyle によって変換されたページ (->オリジナル) /