Although UTF-8 validation is a common task, I'm trying to solve a slightly different task; given a string of bytes, work out whether it could potentially be a fragment of a valid UTF-8 string. That's a much less common task, so I thought it'd be easiest to write the code myself. Here's the Perl subroutine I came up with:
# Takes one argument, a string with codepoints in the range 0-255
# (representing the bytes of input); returns truthy if and only if some
# valid UTF-8 string has that string as a substring.
sub could_be_substring_of_valid_utf8 {
# We're using regexes on bytes (encoded as codepoints), so
# codepoints don't have their normal Unicode meaning. Tell Perl
# that ASCII = ASCII (it does in UTF-8), but newlines and
# codepoints from 128 up are not special.
use re '/aas';
my $substring = shift;
my $cb = qr/[\x80-\xbf]/; # continuation byte
my $ncb = qr/[^\x80-\xbf]/; # not a continuation byte
my $lb1 = qr/[\xc0-\xdf]/; # leading byte for 1 continuation byte
my $lb2 = qr/[\xe0-\xef]/; # leading byte for 2 continuation bytes
my $lb3 = qr/[\xf0-\xf7]/; # leading byte for 3 continuation bytes
for ($substring) {
# Bytes F8 to FF are never legal in UTF-8 (they'd be leading
# bytes for 4 or more continuation bytes, never necessary to
# represent something within the Unicode range). Bytes F5-F7
# are likewise illegal, because any string starting 0xF5 must
# represent a codepoint of at least U+140000, higher than the
# maximum codepoint of U+10FFFF.
return if /[\xf5-\xff]/;
# Likewise, if we see 4 or more continuation bytes in a row,
# that's also obviously illegal even if we can't see the
# leading byte.
return if /$cb{4}/;
# A string of continuation bytes cannot appear without a leading
# byte to its left, so each individual continuation byte requires
# a leading or continuation byte to its left.
return if /[\x00-\x7f]$cb/;
# A string of leading bytes requires exactly that many continuation
# bytes after it.
return if /($lb1|$lb2|$lb3)$ncb/; # missing first continuation byte
return if /($lb2|$lb3).$ncb/; # missing second continuation byte
return if /$lb3..$ncb/; # missing third continuation byte
return if /$lb1$cb$cb/; # too many continuation bytes
return if /$lb2$cb$cb$cb/; # too many continuation bytes
return if /$lb3$cb$cb$cb$cb/; # too many continuation bytes
# Codepoints U+D800 to U+DFFF must not be encoded in UTF-8 (as
# they're used only as an internal part of UTF-16); ban UTF-8
# sequences that attempt to do that. The encoding of U+D???
# always starts with 0xED, and the MSB of the next nybble
# appears in the 0x20s place of the first continuation byte,
# which is enough information to recognise one even from a
# partial encoding.
return if /\xed[\xa0-\xbf]/; # encodes a surrogate
# Codepoints U+FFFE and U+FFFF must not be encoded in UTF-8.
# Their encodings are 0xEF 0xBF 0xBE and 0xEF 0xBF 0xBF.
return if /\xef\xbf[\xbe\xbf]/;
# Codepoints above U+10FFFF must not be encoded in UTF-8.
# Encodings starting 0xF4 0x9? encode codepoints of the form
# U+11????, and are therefore illegal; likewise, 0xF4 0xA?
# and 0xF4 0xB? encode codepoints of the form U+12???? and
# U+13???? respectively and are illegal. (Codepoints from
# U+140000 upwards have been excluded already by the check
# on illegal bytes.)
return if /\xf4[\x90-\xbf]/;
}
1;
}
As is recommended nowadays for handling raw bytes in Perl, the strings of bytes are represented as strings where each codepoint in the string represents the byte with the corresponding number (e.g. the byte 0xA0 is encoded as "\x{A0}"); this means that you don't have to worry about how the string is represented internally within Perl (as it happens, it has an optimised internal encoding for strings with codepoints in the 0–255 range, but this code doesn't care about whether it's in use or not).
I believe this subroutine is working, but it's fairly complex and I find it hard to be confident that the code is written as well as it could be. My primary concern is code clarity; anything that would make it more obvious that the code is correct would be useful (especially improvements to things like comments and identifiers). I also care about portability (avoiding experimental features or features only present in newer Perls unless they would be make the code more correct, or particularly faster or clearer), which is why I've used for
to set a local binding to $_
rather than the clearer given
; the code should currently work in Perl 5.14 or later, although I haven't tested it on old versions.
I initially wrote this without worrying about performance. It isn't a major concern yet, but may become so in future, so I'd take any tips for improving performance as long as they don't make the code significantly harder to read. (In particular, I suspect that combining everything into one regex might be more efficient; is it possible to do that while keeping things readable?)
1 Answer 1
Performance improvements:
# Checks if the provided UTF-8 is complete (doesn't end mid-sequence).
sub is_complete_utf8 {
my $utf8 = shift;
# This will be used against a reversed string.
state $incomplete_seq = qr/
^
(?: (?&lead_of_2)
| (?&lead_of_3)
| (?&lead_of_4)
| (?&cont_byte) (?: (?&lead_of_3)
| (?&lead_of_4)
| (?&cont_byte) (?&lead_of_4) ))
(?(DEFINE)
(?<lead_of_1> [\x00-\x7F] )
(?<cont_byte> [\x80-\xBF] )
(?<lead_of_2> [\xC0-\xDF] )
(?<lead_of_3> [\xE0-\xEF] )
(?<lead_of_4> [\xF0-\xF7] )
(?<invalid> [\xF8-\xFF] )
)
/x;
return reverse(substr($utf8, -4)) !~ $incomplete_seq;
}
Notes:
- Adjust to handle invalid UTF-8 as desired.
- I started by writing a version that used a pattern that looked for complete sequences, but it handled invalid UTF-8 poorly. If invalid seqences are to be considered complete, it's cleaner and simpler to match incomplete sequences.
- There's no performance benefit to using
qr//
here; I did that for readability. - Performance could benefit from creating a class equivalent to
(?&lead_of_2) | (?&lead_of_3) | (?&lead_of_4)
and one equivalent to(?&lead_of_3) | (?&lead_of_4)
. - Performance would benefit from writing this function in C instead of using the regex engine.
Comments on your code:
- This sub is expected to return a scalar (something true or false). And while your sub does that when it's invoked in scalar context, it produces surprising results in list context. Replace
return
withreturn 0;
. - Is
1;
really better thanreturn 1;
? - The name of the sub was hard to understand. I reversed the sense of the function (now returns true if complete instead of true if incomplete) to provide a better name. (You could use
is_incomplete_utf8
, but that could lead to double-negation in the caller.) - There is a such as excessive commenting :)
- It might be more useful to return a number indicating how many bytes to remove to get only complete sequences.
/aas
is useless in your code./aa
affects some built-in characters classes (e.g.\s
) and\b
./s
affects.
.
-
\$\begingroup\$ Isn't this solving an entirely different task to my program? I want to distinguish between sequences like 0x80 0x81 0x82 which could validly appear somewhere inside a UTF-8 string, and 0x80 0x81 0x82 0x83 which can't. Your code seems to assume valid UTF-8, and checks to see whether it ends at a character boundary or not. (Incidentally, I was under the impression that
return;
is better thanreturn 0;
for producing falsey values, because the list(0)
is truthy.) \$\endgroup\$ais523– ais5232020年05月29日 21:58:57 +00:00Commented May 29, 2020 at 21:58 -
1\$\begingroup\$ Re "Isn't this solving an entirely different task to my program?", Maybe. I did what the title says: Identify whether a string is a fragment of UTF-8. This is a common problem when reading from a file handle, sockets and pipes in particular. \$\endgroup\$ikegami– ikegami2020年05月30日 02:53:59 +00:00Commented May 30, 2020 at 2:53
-
1\$\begingroup\$ Re "Your code seems to assume valid UTF-8,", No, it correctly handles invalid UTF-8. It's not a validator, if that's what you mean. But you said you weren't looking for a validator. \$\endgroup\$ikegami– ikegami2020年05月30日 03:12:29 +00:00Commented May 30, 2020 at 3:12
-
1
-
1\$\begingroup\$ Say sub
is_foo
is expected to return a boolean. Most of the time, it might be used asif (is_foo()) { ... }
ormy $foo = is_foo()
. In that situation, areturn
inis_foo
would effectively bereturn undef
, and this would work correctly. But one day, someone will domy %h = ( foo => is_foo(), bar => "moo" );
That will fail withreturn
. So say what you mean and usereturn undef
(orreturn 0
) explicitly instead of relying onreturn
returningundef
. \$\endgroup\$ikegami– ikegami2020年05月30日 03:13:04 +00:00Commented May 30, 2020 at 3:13
{0, 2, 3, 4, 5, 6, 7, 8}
, run the DFA over your substring for each of these states and see if any of these states is still valid, that is, not1
. \$\endgroup\$