Checking whether a string fragment could be part of a longer UTF-8 string

Question 1

Although UTF-8 validation is a common task, I'm trying to solve a slightly different task; given a string of bytes, work out whether it could potentially be a fragment of a valid UTF-8 string. That's a much less common task, so I thought it'd be easiest to write the code myself. Here's the Perl subroutine I came up with:

# Takes one argument, a string with codepoints in the range 0-255
# (representing the bytes of input); returns truthy if and only if some
# valid UTF-8 string has that string as a substring.
sub could_be_substring_of_valid_utf8 {
 # We're using regexes on bytes (encoded as codepoints), so
 # codepoints don't have their normal Unicode meaning. Tell Perl
 # that ASCII = ASCII (it does in UTF-8), but newlines and
 # codepoints from 128 up are not special.
 use re '/aas';
 my $substring = shift;
 my $cb = qr/[\x80-\xbf]/; # continuation byte
 my $ncb = qr/[^\x80-\xbf]/; # not a continuation byte
 my $lb1 = qr/[\xc0-\xdf]/; # leading byte for 1 continuation byte
 my $lb2 = qr/[\xe0-\xef]/; # leading byte for 2 continuation bytes
 my $lb3 = qr/[\xf0-\xf7]/; # leading byte for 3 continuation bytes
 for ($substring) {
 # Bytes F8 to FF are never legal in UTF-8 (they'd be leading
 # bytes for 4 or more continuation bytes, never necessary to
 # represent something within the Unicode range). Bytes F5-F7
 # are likewise illegal, because any string starting 0xF5 must
 # represent a codepoint of at least U+140000, higher than the
 # maximum codepoint of U+10FFFF.
 return if /[\xf5-\xff]/;
 # Likewise, if we see 4 or more continuation bytes in a row,
 # that's also obviously illegal even if we can't see the
 # leading byte.
 return if /$cb{4}/;
 # A string of continuation bytes cannot appear without a leading
 # byte to its left, so each individual continuation byte requires
 # a leading or continuation byte to its left.
 return if /[\x00-\x7f]$cb/;
 # A string of leading bytes requires exactly that many continuation
 # bytes after it.
 return if /($lb1|$lb2|$lb3)$ncb/; # missing first continuation byte
 return if /($lb2|$lb3).$ncb/; # missing second continuation byte
 return if /$lb3..$ncb/; # missing third continuation byte
 return if /$lb1$cb$cb/; # too many continuation bytes
 return if /$lb2$cb$cb$cb/; # too many continuation bytes
 return if /$lb3$cb$cb$cb$cb/; # too many continuation bytes
 # Codepoints U+D800 to U+DFFF must not be encoded in UTF-8 (as
 # they're used only as an internal part of UTF-16); ban UTF-8
 # sequences that attempt to do that. The encoding of U+D???
 # always starts with 0xED, and the MSB of the next nybble
 # appears in the 0x20s place of the first continuation byte,
 # which is enough information to recognise one even from a
 # partial encoding.
 return if /\xed[\xa0-\xbf]/; # encodes a surrogate
 # Codepoints U+FFFE and U+FFFF must not be encoded in UTF-8.
 # Their encodings are 0xEF 0xBF 0xBE and 0xEF 0xBF 0xBF.
 return if /\xef\xbf[\xbe\xbf]/;
 # Codepoints above U+10FFFF must not be encoded in UTF-8.
 # Encodings starting 0xF4 0x9? encode codepoints of the form
 # U+11????, and are therefore illegal; likewise, 0xF4 0xA?
 # and 0xF4 0xB? encode codepoints of the form U+12???? and
 # U+13???? respectively and are illegal. (Codepoints from
 # U+140000 upwards have been excluded already by the check
 # on illegal bytes.)
 return if /\xf4[\x90-\xbf]/;
 }
 1;
}

As is recommended nowadays for handling raw bytes in Perl, the strings of bytes are represented as strings where each codepoint in the string represents the byte with the corresponding number (e.g. the byte 0xA0 is encoded as "\x{A0}"); this means that you don't have to worry about how the string is represented internally within Perl (as it happens, it has an optimised internal encoding for strings with codepoints in the 0–255 range, but this code doesn't care about whether it's in use or not).

I believe this subroutine is working, but it's fairly complex and I find it hard to be confident that the code is written as well as it could be. My primary concern is code clarity; anything that would make it more obvious that the code is correct would be useful (especially improvements to things like comments and identifiers). I also care about portability (avoiding experimental features or features only present in newer Perls unless they would be make the code more correct, or particularly faster or clearer), which is why I've used for to set a local binding to $_ rather than the clearer given; the code should currently work in Perl 5.14 or later, although I haven't tested it on old versions.

I initially wrote this without worrying about performance. It isn't a major concern yet, but may become so in future, so I'd take any tips for improving performance as long as they don't make the code significantly harder to read. (In particular, I suspect that combining everything into one regex might be more efficient; is it possible to do that while keeping things readable?)

Question 2

You should also post your unit tests for this function, just for completeness. This might give some hints on edge cases you might have forgotten to test.

Question 3

I would have chosen a different approach: take a DFA that accepts UTF-8 (there should already exist plenty of them, well-tested) and transform that into an NFA that starts in every intermediate state. Then validate that after processing the substring, the NFA is still in some valid state. — After doing this, you have two implementations that should behave the same. Use a fuzzer to generate random byte sequences and ensure that both implementations indeed behave the same.

Question 4

maybe put the regexes in a state variable or in a package variable to avoid recompiling them on each call to the subroutine?

Question 5

Automated tests might not help you realize what your wrong assumptions are, but it would help us find and tell you what your wrong assumptions are. Since that doesn't exist it would be nice to include some usage examples. Don't you want some of those for your documentation anyway?

Question 6

@ais523 When you don't generate the DFA yourself but use a preexisting implementation, that implementation will not have any of your misconceptions. The first google hit was this awesome simple and well-documented piece of code. To compare this with your code, you would just have to allocate an array of 8 state variables, initialize them to {0, 2, 3, 4, 5, 6, 7, 8}, run the DFA over your substring for each of these states and see if any of these states is still valid, that is, not 1.

Question 7

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
 my $utf8 = shift;
 # This will be used against a reversed string.
 state $incomplete_seq = qr/
 ^
 (?: (?&lead_of_2)
 | (?&lead_of_3)
 | (?&lead_of_4)
 | (?&cont_byte) (?: (?&lead_of_3)
 | (?&lead_of_4)
 | (?&cont_byte) (?&lead_of_4) ))
 (?(DEFINE)
 (?<lead_of_1> [\x00-\x7F] )
 (?<cont_byte> [\x80-\xBF] )
 (?<lead_of_2> [\xC0-\xDF] )
 (?<lead_of_3> [\xE0-\xEF] )
 (?<lead_of_4> [\xF0-\xF7] )
 (?<invalid> [\xF8-\xFF] )
 )
 /x;
 return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.
I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.
There's no performance benefit to using qr// here; I did that for readability.
Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.
Is 1; really better than return 1;?
The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /aa affects some built-in characters classes (e.g. \s) and \b.
- /s affects ..

Question 8

Isn't this solving an entirely different task to my program? I want to distinguish between sequences like 0x80 0x81 0x82 which could validly appear somewhere inside a UTF-8 string, and 0x80 0x81 0x82 0x83 which can't. Your code seems to assume valid UTF-8, and checks to see whether it ends at a character boundary or not. (Incidentally, I was under the impression that return; is better than return 0; for producing falsey values, because the list (0) is truthy.)

Question 9

Re "Isn't this solving an entirely different task to my program?", Maybe. I did what the title says: Identify whether a string is a fragment of UTF-8. This is a common problem when reading from a file handle, sockets and pipes in particular.

Question 10

Re "Your code seems to assume valid UTF-8,", No, it correctly handles invalid UTF-8. It's not a validator, if that's what you mean. But you said you weren't looking for a validator.

Question 11

Re "because the list (0) is truthy.", Lists are neither true nor false because you can't evaluate the truthness of a list. Only scalars are true or false

Question 12

Say sub is_foo is expected to return a boolean. Most of the time, it might be used as if (is_foo()) { ... } or my $foo = is_foo(). In that situation, a return in is_foo would effectively be return undef, and this would work correctly. But one day, someone will do my %h = ( foo => is_foo(), bar => "moo" ); That will fail with return. So say what you mean and use return undef (or return 0) explicitly instead of relying on return returning undef.

ikegami ikegami 2741 silver badge9 bronze badges · Answer 1 · 2020-05-28 20:53:53Z

Performance improvements:

# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
 my $utf8 = shift;
 # This will be used against a reversed string.
 state $incomplete_seq = qr/
 ^
 (?: (?&lead_of_2)
 | (?&lead_of_3)
 | (?&lead_of_4)
 | (?&cont_byte) (?: (?&lead_of_3)
 | (?&lead_of_4)
 | (?&cont_byte) (?&lead_of_4) ))
 (?(DEFINE)
 (?<lead_of_1> [\x00-\x7F] )
 (?<cont_byte> [\x80-\xBF] )
 (?<lead_of_2> [\xC0-\xDF] )
 (?<lead_of_3> [\xE0-\xEF] )
 (?<lead_of_4> [\xF0-\xF7] )
 (?<invalid> [\xF8-\xFF] )
 )
 /x;
 return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}

Notes:

Adjust to handle invalid UTF-8 as desired.
I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.
There's no performance benefit to using qr// here; I did that for readability.
Performance could benefit from creating a class equivalent to (?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4) and one equivalent to (?&lead_of_3) | (?&lead_of_4).
Performance would benefit from writing this function in C instead of using the regex engine.

Comments on your code:

This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace return with return 0;.
Is 1; really better than return 1;?
The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use is_incomplete_utf8, but that could lead to double-negation in the caller.)
There is a such as excessive commenting :)
It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas is useless in your code.
- /aa affects some built-in characters classes (e.g. \s) and \b.
- /s affects ..

Isn't this solving an entirely different task to my program? I want to distinguish between sequences like 0x80 0x81 0x82 which could validly appear somewhere inside a UTF-8 string, and 0x80 0x81 0x82 0x83 which can't. Your code seems to assume valid UTF-8, and checks to see whether it ends at a character boundary or not. (Incidentally, I was under the impression that return; is better than return 0; for producing falsey values, because the list (0) is truthy.)
Re "Isn't this solving an entirely different task to my program?", Maybe. I did what the title says: Identify whether a string is a fragment of UTF-8. This is a common problem when reading from a file handle, sockets and pipes in particular.
Re "Your code seems to assume valid UTF-8,", No, it correctly handles invalid UTF-8. It's not a validator, if that's what you mean. But you said you weren't looking for a validator.
Re "because the list (0) is truthy.", Lists are neither true nor false because you can't evaluate the truthness of a list. Only scalars are true or false
Say sub is_foo is expected to return a boolean. Most of the time, it might be used as if (is_foo()) { ... } or my $foo = is_foo(). In that situation, a return in is_foo would effectively be return undef, and this would work correctly. But one day, someone will do my %h = ( foo => is_foo(), bar => "moo" ); That will fail with return. So say what you mean and use return undef (or return 0) explicitly instead of relying on return returning undef.

Stack Exchange Network

Checking whether a string fragment could be part of a longer UTF-8 string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Checking whether a string fragment could be part of a longer UTF-8 string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions