Improve internal function argument parsing performance by reducing code bloat #18436

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

nielsdos wants to merge 2 commits into php:master

from nielsdos:zend-parse-arg-improve-1

Open

Improve internal function argument parsing performance by reducing code bloat #18436

nielsdos wants to merge 2 commits into php:master from nielsdos:zend-parse-arg-improve-1

Conversation

nielsdos

Copy link

Member

@nielsdos nielsdos commented Apr 26, 2025 •

edited

Loading

The goal of this PR is to reduce some of the code bloat induced by fast-ZPP. Reduced code bloat results in fewer cache misses (and better DSB coverage), and fewer instructions executed.
This is an initial proposal, it may be possible to optimize this further.

If we take a look at a simple function:

PHP_FUNCTION(twice) {
 zend_long foo;
 ZEND_PARSE_PARAMETERS_START(1, 1)
 Z_PARAM_LONG(foo)
 ZEND_PARSE_PARAMETERS_END();
 RETURN_LONG(foo * 2);
}

We obtain the following assembly on x86-64 in the non-cold blocks:

<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	sub $0x10,%rsp
<+8>:	mov %fs:0x28,%r12
<+17>:	mov %r12,0x8(%rsp)
<+22>:	mov0x2c(%rdi),%r12d
<+26>:	cmp $0x1,%r12d
<+30>:	jne0x22beaf <zif_twice-3212257>
<+36>:	cmpb $0x4,0x58(%rdi)
<+40>:	mov %rsi,%rbp
<+43>:	jne0x53c2f0 <zif_twice+96>
<+45>:	mov0x50(%rdi),%rax
<+49>:	add %rax,%rax
<+52>:	movl $0x4,0x8(%rbp)
<+59>:	mov %rax,0x0(%rbp)
<+63>:	mov0x8(%rsp),%rax
<+68>:	sub %fs:0x28,%rax
<+77>:	jne0x53c312 <zif_twice+130>
<+79>:	add $0x10,%rsp
<+83>:	pop %rbx
<+84>:	pop %rbp
<+85>:	pop %r12
<+87>:	ret
<+88>:	nopl 0x0(%rax,%rax,1)
<+96>:	lea0x50(%rdi),%rbx
<+100>:	mov %rsp,%rsi
<+103>:	mov $0x1,%edx
<+108>:	mov %rbx,%rdi
<+111>:	call0x620240 <zend_parse_arg_long_slow>
<+116>:	test %al,%al
<+118>:	je0x22be96 <zif_twice.cold>
<+124>:	mov (%rsp),%rax
<+128>:	jmp0x53c2c1 <zif_twice+49>
<+130>:	call0x201050 <__stack_chk_fail@plt>

Notice how we get the stack protector overhead in this function and also have to reload the parsed value on the slow path. This happens because the parsed value is returned via a pointer. If instead we were to return struct with a value pair (similar to optional in C++ / Option in Rust), then the values are returned via registers. This means that we no longer have stack protector overhead and we also don't need to reload a value, resulting in better register usage. This is the resulting assembly for the sample function after this patch:

<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	mov0x2c(%rdi),%r12d
<+8>:	cmp $0x1,%r12d
<+12>:	jne0x22d482 <zif_twice-3205454>
<+18>:	cmpb $0x4,0x58(%rdi)
<+22>:	mov %rsi,%rbp
<+25>:	jne0x53be08 <zif_twice+56>
<+27>:	mov0x50(%rdi),%rax
<+31>:	add %rax,%rax
<+34>:	movl $0x4,0x8(%rbp)
<+41>:	mov %rax,0x0(%rbp)
<+45>:	pop %rbx
<+46>:	pop %rbp
<+47>:	pop %r12
<+49>:	ret
<+50>:	nopw 0x0(%rax,%rax,1)
<+56>:	lea0x50(%rdi),%rbx
<+60>:	mov $0x1,%esi
<+65>:	mov %rbx,%rdi
<+68>:	call0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test %dl,%dl
<+75>:	je0x22d46a <zif_twice.cold>
<+81>:	jmp0x53bdef <zif_twice+31>

The following uses the default benchmark programs we use in CI. Each program is ran on php-cgi with the appropriate -T argument, then repeated 15 times. It shows a small performance improvement on Symfony both with and without JIT, and a small improvement on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
Results obtained on an i7-4790 with 32GiB RAM.

Test	Old Mean	Old Stddev	New Mean	New Stddev
Symfony, no JIT (-T10,50)	0.5324	0.0050	0.5272	0.0042
Symfony, tracing JIT (-T10,50)	0.5301	0.0029	0.5264	0.0036
WordPress, no JIT (-T5,25)	0.7408	0.0049	0.7404	0.0048
WordPress, tracing JIT (-T5,25)	0.6814	0.0052	0.6770	0.0055

I was not able to measure any meaningful difference for our micro benchmarks Zend/bench.php and Zend/micro_bench.php.

The Valgrind instruction counts also show a decrease: -0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

@nielsdos


 Improve internal function argument parsing performance by reducing co...

b3dec4a

...de bloat
The goal of this PR is to reduce some of the code bloat induced by fast-ZPP.
Reduced code bloat results in fewer cache misses (and better DSB coverage),
and fewer instructions executed.
If we take a look at a simple function:
```c
PHP_FUNCTION(twice) {
 zend_long foo;
 ZEND_PARSE_PARAMETERS_START(1, 1)
 Z_PARAM_LONG(foo)
 ZEND_PARSE_PARAMETERS_END();
 RETURN_LONG(foo * 2);
}
```
We obtain the following assembly on x86-64 in the non-cold blocks:
```s
<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	sub 0ドルx10,%rsp
<+8>:	mov %fs:0x28,%r12
<+17>:	mov %r12,0x8(%rsp)
<+22>:	mov 0x2c(%rdi),%r12d
<+26>:	cmp 0ドルx1,%r12d
<+30>:	jne 0x22beaf <zif_twice-3212257>
<+36>:	cmpb 0ドルx4,0x58(%rdi)
<+40>:	mov %rsi,%rbp
<+43>:	jne 0x53c2f0 <zif_twice+96>
<+45>:	mov 0x50(%rdi),%rax
<+49>:	add %rax,%rax
<+52>:	movl 0ドルx4,0x8(%rbp)
<+59>:	mov %rax,0x0(%rbp)
<+63>:	mov 0x8(%rsp),%rax
<+68>:	sub %fs:0x28,%rax
<+77>:	jne 0x53c312 <zif_twice+130>
<+79>:	add 0ドルx10,%rsp
<+83>:	pop %rbx
<+84>:	pop %rbp
<+85>:	pop %r12
<+87>:	ret
<+88>:	nopl 0x0(%rax,%rax,1)
<+96>:	lea 0x50(%rdi),%rbx
<+100>:	mov %rsp,%rsi
<+103>:	mov 0ドルx1,%edx
<+108>:	mov %rbx,%rdi
<+111>:	call 0x620240 <zend_parse_arg_long_slow>
<+116>:	test %al,%al
<+118>:	je 0x22be96 <zif_twice.cold>
<+124>:	mov (%rsp),%rax
<+128>:	jmp 0x53c2c1 <zif_twice+49>
<+130>:	call 0x201050 <__stack_chk_fail@plt>
```
Notice how we get the stack protector overhead in this function
and also have to reload the parsed value on the slow path.
This happens because the parsed value is returned via a pointer.
If instead we were to return struct with a value pair
(similar to optional in C++ / Option in Rust), then the values are returned
via registers. This means that we no longer have stack protector overhead
and we also don't need to reload a value, resulting in better register usage.
This is the resulting assembly for the sample function after this patch:
```s
<+0>:	push %r12
<+2>:	push %rbp
<+3>:	push %rbx
<+4>:	mov 0x2c(%rdi),%r12d
<+8>:	cmp 0ドルx1,%r12d
<+12>:	jne 0x22d482 <zif_twice-3205454>
<+18>:	cmpb 0ドルx4,0x58(%rdi)
<+22>:	mov %rsi,%rbp
<+25>:	jne 0x53be08 <zif_twice+56>
<+27>:	mov 0x50(%rdi),%rax
<+31>:	add %rax,%rax
<+34>:	movl 0ドルx4,0x8(%rbp)
<+41>:	mov %rax,0x0(%rbp)
<+45>:	pop %rbx
<+46>:	pop %rbp
<+47>:	pop %r12
<+49>:	ret
<+50>:	nopw 0x0(%rax,%rax,1)
<+56>:	lea 0x50(%rdi),%rbx
<+60>:	mov 0ドルx1,%esi
<+65>:	mov %rbx,%rdi
<+68>:	call 0x61e7b0 <zend_parse_arg_long_slow>
<+73>:	test %dl,%dl
<+75>:	je 0x22d46a <zif_twice.cold>
<+81>:	jmp 0x53bdef <zif_twice+31>
```
The following uses the default benchmark programs we use in CI.
Each program is ran on php-cgi with the appropriate `-T` argument, then repeated 15 times.
It shows a small performance improvement on Symfony both with and without JIT, and a small improvement
on WordPress with JIT.
For WordPress, the difference is small as my CPU is bottlenecked on some other stuff as well.
| Test | Old Mean | Old Stddev | New Mean | New Stddev |
|---------------------------------|----------|------------|----------|------------|
| Symfony, no JIT (-T10,50) | 0.5324 | 0.0050 | 0.5272 | 0.0042 |
| Symfony, tracing JIT (-T10,50) | 0.5301 | 0.0029 | 0.5264 | 0.0036 |
| WordPress, no JIT (-T5,25) | 0.7408 | 0.0049 | 0.7404 | 0.0048 |
| WordPress, tracing JIT (-T5,25) | 0.6814 | 0.0052 | 0.6770 | 0.0055 |
I was not able to measure any meaningful difference for our micro benchmarks
`Zend/bench.php` and `Zend/micro_bench.php`.
The Valgrind instruction counts also show a decrease:
-0.19% on Symfony without JIT, and -0.14% on WordPress without JIT (see CI).

@nielsdos nielsdos requested review from dstogov, bukka, Girgias, devnexen, alexdowad, youkidearitai and SakiTakamachi as code owners

April 26, 2025 18:32

@nielsdos nielsdos requested review from arnaud-lb and iluuu1994

April 26, 2025 18:32

@github-actions github-actions bot added Extension: openssl Category: Engine Extension: bcmath Extension: gmp Extension: mbstring Extension: pgsql Extension: session Extension: standard ABI break labels

Apr 26, 2025

@nielsdos

Copy link

Member Author

nielsdos commented Apr 26, 2025

One potential optimization idea is to store a bool is_empty instead of bool has_value. Because in that case we have to store a 0 instead of a 1 for the "success path". This may be a bit faster on the x86 where setting the lower parts of some registers also clears the upper part.

@Girgias Girgias removed their request for review

April 26, 2025 20:22

@SakiTakamachi

Copy link

Member

SakiTakamachi commented Apr 27, 2025

When building on Arm, an uninitialized warning was triggered at the following location.
Although there doesn’t seem to be any uninitialized code path, it might be a good idea to set an initial value just to silence the warning.

php-src/ext/date/php_date.c

Line 5514 in c919ab4

N = (calc_sunset ? h_set : h_rise) + gmt_offset;

I also dumped the non-cold block assembly on Arm.
I already knew theoretically that it should have an effect on Arm as well, and it seems that is indeed the case.

before:

 0: a9bc7bfd stp x29, x30, [sp, #-64]!
 4: 90000002 adrp x2, 0 <__stack_chk_guard>
 8: 910003fd mov x29, sp
 c: f9400042 ldr x2, [x2]
 10: a90153f3 stp x19, x20, [sp, #16]
 14: aa0103f4 mov x20, x1
 18: f90013f5 str x21, [sp, #32]
 1c: b9402c15 ldr w21, [x0, #44]
 20: f9400041 ldr x1, [x2]
 24: f9001fe1 str x1, [sp, #56]
 28: d2800001 mov x1, #0x0 // #0
 2c: 710006bf cmp w21, #0x1
 30: 54000401 b.ne b0 <zif_twice+0xb0> // b.any
 34: 39416001 ldrb w1, [x0, #88]
 38: 7100103f cmp w1, #0x4
 3c: 54000221 b.ne 80 <zif_twice+0x80> // b.any
 40: f9402800 ldr x0, [x0, #80]
 44: d37ff800 lsl x0, x0, #1
 48: 52800081 mov w1, #0x4 // #4
 4c: f9000280 str x0, [x20]
 50: b9000a81 str w1, [x20, #8]
 54: 90000000 adrp x0, 0 <__stack_chk_guard>
 58: f9400000 ldr x0, [x0]
 5c: f9401fe2 ldr x2, [sp, #56]
 60: f9400001 ldr x1, [x0]
 64: eb010042 subs x2, x2, x1
 68: d2800001 mov x1, #0x0 // #0
 6c: 540001c1 b.ne a4 <zif_twice+0xa4> // b.any
 70: a94153f3 ldp x19, x20, [sp, #16]
 74: f94013f5 ldr x21, [sp, #32]
 78: a8c47bfd ldp x29, x30, [sp], #64
 7c: d65f03c0 ret
 80: 91014013 add x19, x0, #0x50
 84: 2a1503e2 mov w2, w21
 88: aa1303e0 mov x0, x19
 8c: 9100c3e1 add x1, sp, #0x30
 90: 94000000 bl 0 <zend_parse_arg_long_slow>
 94: 52800122 mov w2, #0x9 // #9
 98: 72001c1f tst w0, #0xff
 9c: 54000061 b.ne a8 <zif_twice+0xa8> // b.any
 a0: 14000000 b 0 <zif_twice>
 a4: 94000000 bl 0 <__stack_chk_fail>
 a8: f9401be0 ldr x0, [sp, #48]
 ac: 17ffffe6 b 44 <zif_twice+0x44>
 b0: 14000000 b 0 <zif_twice>

after:

 0: a9bd7bfd stp x29, x30, [sp, #-48]!
 4: 910003fd mov x29, sp
 8: f90013f5 str x21, [sp, #32]
 c: b9402c15 ldr w21, [x0, #44]
 10: a90153f3 stp x19, x20, [sp, #16]
 14: 710006bf cmp w21, #0x1
 18: 540003a1 b.ne 8c <zif_twice+0x8c> // b.any
 1c: aa0103f4 mov x20, x1
 20: 39416001 ldrb w1, [x0, #88]
 24: 7100103f cmp w1, #0x4
 28: 54000141 b.ne 50 <zif_twice+0x50> // b.any
 2c: f9402800 ldr x0, [x0, #80]
 30: d37ff800 lsl x0, x0, #1
 34: 52800081 mov w1, #0x4 // #4
 38: f94013f5 ldr x21, [sp, #32]
 3c: f9000280 str x0, [x20]
 40: b9000a81 str w1, [x20, #8]
 44: a94153f3 ldp x19, x20, [sp, #16]
 48: a8c37bfd ldp x29, x30, [sp], #48
 4c: d65f03c0 ret
 50: 91014013 add x19, x0, #0x50
 54: 2a1503e1 mov w1, w21
 58: aa1303e0 mov x0, x19
 5c: 94000000 bl 0 <zend_parse_arg_long_slow>
 60: 72001c3f tst w1, #0xff
 64: 54fffe61 b.ne 30 <zif_twice+0x30> // b.any
 68: 52800120 mov w0, #0x9 // #9
 6c: aa1303e4 mov x4, x19
 70: 2a1503e1 mov w1, w21
 74: a94153f3 ldp x19, x20, [sp, #16]
 78: 52800003 mov w3, #0x0 // #0
 7c: f94013f5 ldr x21, [sp, #32]
 80: d2800002 mov x2, #0x0 // #0
 84: a8c37bfd ldp x29, x30, [sp], #48
 88: 14000000 b 0 <zend_wrong_parameter_error>
 8c: 14000000 b 0 <zif_twice>

Regarding is_empty, the performance should theoretically be the same on Arm, so I think it’s fine to decide based on what’s best for x86.

@nielsdos nielsdos requested a review from derickr as a code owner

April 27, 2025 09:38

@nielsdos

Copy link

Member Author

nielsdos commented Apr 27, 2025

@SakiTakamachi Thanks. The warning you get is a false positive. Previously the compiler didn't warn because the pointer escaped but now it no longer is a pointer escape and the compiler's warning analysis is not strong enough to avoid the warning.

SakiTakamachi

SakiTakamachi approved these changes

Apr 27, 2025

View reviewed changes

@nielsdos


 Fix false positive warning

a1c97a8

@github-actions github-actions bot added the Extension: date label

Apr 27, 2025

devnexen

devnexen approved these changes

Apr 27, 2025

View reviewed changes

Copy link

Member

@devnexen devnexen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall mstm but lgtm for pgsql at least

dstogov

dstogov reviewed

Apr 28, 2025

View reviewed changes

Copy link

Member

@dstogov dstogov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning structures is a not well defined system depended C feature. On one system this may be implemented through a pair of registers on the others through a stack allocated area. GCC has special flags that control this behavior (-fpcc-struct-return/-freg-struct-return) but they break ABI.

I though about this feature long time ago (for general API optimization), but decided not to use it.

Please check the effect of the patch on x86 Linux, and x86/x86_64 Windows.

@nielsdos

Copy link

Member Author

nielsdos commented May 4, 2025

Linux 32-bit x86: https://gist.github.com/nielsdos/9074649120436c5459ab33e131d023c4
When comparing the assembly, it looks like the new code still returns the data via the stack and we still get the stack protector check. While we get one less instruction in the new code, we do use more stack space in the new code as well.

Windows x64: https://gist.github.com/nielsdos/e3fc03963ab6e39076da0d4f79115d8e
Assembly is very similar, but uses more stack space as the struct is returned via the stack, needs some extra instructions to copy the data over unfortunately.

Windows x86-32: https://gist.github.com/nielsdos/2091201b1255115f840aafd75872c3ac
Assembly is similar, but the new code now uses the values returned via registers instead of checking the boolean in "AL register" and then reading the zend_long from a stack slot. Fast path is the same size.

@dstogov

Copy link

Member

dstogov commented May 5, 2025

So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?

@nielsdos

Copy link

Member Author

nielsdos commented May 5, 2025 •

edited

Loading

So, very light improvement on x86 Linux and degradation on Windows (both 32 and 64-bits). Correct?

That's correct. The degradation I saw on 32-bit Windows is less severe than on 64-bit Windows.
I also tried wallclock time micro benchmarks but the execution time is too noisy to make any conclusions.

Girgias added a commit to Girgias/php-src that referenced this pull request

Aug 4, 2025

@Girgias


 ZPP: Reduce code bloat for Z_PARAM_STR

78b2cd3

This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.

@Girgias Girgias mentioned this pull request

Aug 4, 2025

ZPP: Reduce code bloat for Z_PARAM_STR #19359

Open

Girgias added a commit to Girgias/php-src that referenced this pull request

Aug 4, 2025

@Girgias


 ZPP: Reduce code bloat for Z_PARAM_STR

4cee049

This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.

Girgias added a commit to Girgias/php-src that referenced this pull request

Aug 4, 2025

@Girgias


 ZPP: Reduce code bloat for Z_PARAM_STR

a434a77

This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.

Girgias added a commit to Girgias/php-src that referenced this pull request

Sep 22, 2025

@Girgias


 ZPP: Reduce code bloat for Z_PARAM_STR

417706a

This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.

Girgias added a commit to Girgias/php-src that referenced this pull request

Sep 23, 2025

@Girgias


 ZPP: Reduce code bloat for Z_PARAM_STR

61337e0

This idea is based on @nielsdos previous PR to reduce codebloat: php#18436
To do so we create a specialized function to handle parameters that only accept strings and not null such that we can assign the zend_string to the destination directly.

Reviewers

@dstogov dstogov dstogov left review comments

@devnexen devnexen devnexen approved these changes

@SakiTakamachi SakiTakamachi SakiTakamachi approved these changes

@bukka bukka Awaiting requested review from bukka bukka is a code owner

@alexdowad alexdowad Awaiting requested review from alexdowad alexdowad is a code owner

@youkidearitai youkidearitai Awaiting requested review from youkidearitai youkidearitai is a code owner

@arnaud-lb arnaud-lb Awaiting requested review from arnaud-lb

@iluuu1994 iluuu1994 Awaiting requested review from iluuu1994

@derickr derickr Awaiting requested review from derickr derickr is a code owner

Labels

ABI break Category: Engine Extension: bcmath Extension: date Extension: gmp Extension: mbstring Extension: openssl Extension: pgsql Extension: session Extension: standard

4 participants

@nielsdos @SakiTakamachi @dstogov @devnexen

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve internal function argument parsing performance by reducing code bloat #18436

Are you sure you want to change the base?

Improve internal function argument parsing performance by reducing code bloat #18436

Uh oh!

Conversation

@nielsdos nielsdos commented Apr 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

nielsdos commented Apr 26, 2025

Uh oh!

SakiTakamachi commented Apr 27, 2025

Uh oh!

nielsdos commented Apr 27, 2025

Uh oh!

@devnexen devnexen left a comment

Choose a reason for hiding this comment

Uh oh!

@dstogov dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

nielsdos commented May 4, 2025

Uh oh!

dstogov commented May 5, 2025

Uh oh!

nielsdos commented May 5, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Improve internal function argument parsing performance by reducing code bloat #18436

Are you sure you want to change the base?

Improve internal function argument parsing performance by reducing code bloat #18436

Uh oh!

Conversation

@nielsdos nielsdos commented Apr 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nielsdos commented Apr 26, 2025

Uh oh!

SakiTakamachi commented Apr 27, 2025

Uh oh!

nielsdos commented Apr 27, 2025

Uh oh!

@devnexen devnexen left a comment

Choose a reason for hiding this comment

Uh oh!

@dstogov dstogov left a comment

Choose a reason for hiding this comment

Uh oh!

nielsdos commented May 4, 2025

Uh oh!

dstogov commented May 5, 2025

Uh oh!

nielsdos commented May 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

@nielsdos nielsdos commented Apr 26, 2025 •

edited

Loading

nielsdos commented May 5, 2025 •

edited

Loading