Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

UTF-8 validate strings before interning #10870

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
iluuu1994 wants to merge 1 commit into php:master
base: master
Choose a base branch
Loading
from iluuu1994:interned-strings-validate-utf8

Conversation

Copy link
Member

@iluuu1994 iluuu1994 commented Mar 17, 2023

Fixes GH-10853

@alexdowad Sorry, I've only realized that you mentioned in #10409 that you'd like to work on this once I was already done.

I see it was also suggested that atomics could be used. I don't know C11 atomics at all but from what I've read it seems the type_info would have to be made atomic which is probably not a good idea as it's not usually necessary. But please correct me if I'm wrong. I'd also prefer if we didn't need to check all strings when interning, as it is only used in few cases atm and mbstrings implementation of UTF-8 validate is much more optimized.

1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

ZEND_API bool zend_string_validate_utf8(zend_string *string) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

@iluuu1994 iluuu1994 force-pushed the interned-strings-validate-utf8 branch from 439fa8a to 8f9322b Compare March 17, 2023 17:36
Copy link
Contributor

This looks great to me. Thanks.

The table-driven state machine for UTF-8 validity checking is very interesting. I'm interested to know how it compares when benchmarked against the 'fallback' UTF-8 validity checking implementation in mbstring (which was borrowed from PCRE).

Copy link
Contributor

To fix #10853, a new flag should be added if the UTF-8 validity was checked or not, otherwise invalid UTF-8 string will need to be check on each call.

👍

It looks like availability of bits in the object header should not be a problem... type_info, where the GC flags are kept, is 32 bits wide, and it looks like there are currently only 9 flags for zend_string objects.

Copy link
Member

@Girgias Girgias left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nits, but LGTM.

Might make sense to pull the MBString UTF-8 checking code as it may be faster, but I'll let @alexdowad benchmark the different implementation :-)

Comment on lines +536 to +537
if (state == UTF8_REJECT)
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CS Nit:

Suggested change
if (state == UTF8_REJECT)
break;
if (state == UTF8_REJECT) {
break;
}

1,3,1,1,1,1,1,3,1,3,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1, // s7..s8
};

ZEND_API bool zend_string_validate_utf8(zend_string *string) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem to modify the pointer

Suggested change
ZEND_API bool zend_string_validate_utf8(zend_string *string) {
ZEND_API bool zend_string_validate_utf8(constzend_string *string) {

Copy link
Contributor

I wrote a test script https://gist.github.com/youkidearitai/566e348e2e23301063ef5a95579d4efd(I put to ext/mbstring/tests) that this PR works fine. Would you like add this .phpt file?

iluuu1994 and Girgias reacted with thumbs up emoji

Copy link
Member Author

@alexdowad Did you have time benchmarking the two implementations? If either implementation is faster it might make sense to move that one to core and use it from mbstring. I'll also check whether we can see a performance hit during opcache persisting.

Comment on lines +501 to +503
// Copyright (c) 2008-2009 Bjoern Hoehrmann <bjoern@hoehrmann.de>
// See http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ for details.
// https://stackoverflow.com/a/22135005/1320374
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is under BSD licence so it should be pointed here as well or ideally new header created so it doesn't mix licenses in the single file. I have been actually using this in jsond for some time and I separated the headers for that.

Copy link
Contributor

@alexdowad Did you have time benchmarking the two implementations?

Hi @iluuu1994. Sorry for the delay. I just put together a lil' benchmark program which runs both functions on a bunch of random strings and compares how long they take.

Using gcc with no optimization, the function from PCRE is somewhat faster. With gcc -O3... Bjoern's function is so ridiculously fast that it's hard to believe. I am just investigating.

Copy link
Contributor

Looks like compiler must have been optimizing calls to Bjoern's function out because the result was not used.

After making sure the result is used, the PCRE function is consistently faster.

iluuu1994 reacted with thumbs up emoji

Copy link
Contributor

Looks like @iluuu1994 would do better to use mb_fast_check_utf8_default from mbstring, or mb_fast_check_utf8_avx2 in cases where it can be used.

We already have a mechanism whereby Zend core calls into some mbstring function through function pointers which are set via zend_multibyte_set_functions... but I guess you won't want to use that for this purpose, because then you could only use the UTF-8 checking functions if mbstring extension is present.

I think you should go ahead and move both mb_fast_check_utf8_{default,avx2} into core, rename them (perhaps change mb_ to zend_), and expose them to callers in extension modules.

iluuu1994 and Girgias reacted with thumbs up emoji

Copy link
Contributor

Maybe also rename _fast_check_ to just _check_.

Copy link
Member Author

Thank you @alexdowad for your investigation! Moving the functions into core sounds good to me. I'll do that and check if there are any performance regressions for startup/compilation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Reviewers

@bukka bukka bukka left review comments

@Girgias Girgias Girgias approved these changes

@alexdowad alexdowad Awaiting requested review from alexdowad alexdowad is a code owner

@dstogov dstogov Awaiting requested review from dstogov dstogov will be requested when the pull request is marked ready for review dstogov is a code owner

@youkidearitai youkidearitai Awaiting requested review from youkidearitai youkidearitai will be requested when the pull request is marked ready for review youkidearitai is a code owner

+1 more reviewer

@mvorisek mvorisek mvorisek left review comments

Reviewers whose approvals may not affect merge requirements
Assignees
No one assigned
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

Check UTF-8 validity for all constant strings on compile time

AltStyle によって変換されたページ (->オリジナル) /