8

So as you probably know, in Perl "utf8" means Perl's looser understanding of UTF-8 which allows characters that technically aren't valid code points in UTF-8. By contrast "UTF-8" (or "utf-8") is Perl's stricter understanding of UTF-8 which doesn't allow invalid code points.

I have a few usage questions related to this distinction:

  1. Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

  2. What happens when you read and write files which were open'd using "UTF-8"? Does character substitution happen to bad characters or does something else happen?

  3. What is the difference between using open with a layer like '>:utf8' and a layer like '>:encoding(utf8)' ? Can both approaches be used with both 'utf8' and 'UTF-8'?

asked Feb 28, 2018 at 21:04

1 Answer 1

12
On Read,
Invalid encoding other
than sequence length
On Read,
Outside of Unicode,
Unicode nonchar, or
Unicode surrogate
On Write,
Outside of Unicode,
Unicode nonchar, or
Unicode surrogate
:encoding(UTF-8) Warns and Replaces Warns and Replaces Warns and Replaces
:encoding(utf8) Warns and Replaces Silently accepts Warns and Outputs
:utf8 Corrupt scalar Silently accepts Warns and Outputs

(This is the state in Perl 5.26.)

Note that :encoding(UTF-8) actually decodes using utf8, then checks if the resulting character is in the acceptable range. This reduces the number of error messages for bad input, so it's good.

(Encoding names are case-insensitive.)


Tests used to generate the above table:

On read

  • :encoding(UTF-8)

     $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
     perl -MB -nle'
     use open ":std", ":encoding(UTF-8)";
     my $sv = B::svref_2object(\$_);
     printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
     '
     utf8 "\xFFFF" does not map to Unicode.
     utf8 "\xD800" does not map to Unicode.
     utf8 "\x200000" does not map to Unicode.
     utf8 "\x80" does not map to Unicode.
     E9 (internal: C3.A9, UTF8=1)
     5C.78.7B.46.46.46.46.7D = \x{FFFF} (internal: 5C.78.7B.46.46.46.46.7D, UTF8=1)
     5C.78.7B.44.38.30.30.7D = \x{D800} (internal: 5C.78.7B.44.38.30.30.7D, UTF8=1)
     5C.78.7B.32.30.30.30.30.30.7D = \x{200000} (internal: 5C.78.7B.32.30.30.30.30.30.7D, UTF8=1)
     5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :encoding(utf8)

     $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
     perl -MB -nle'
     use open ":std", ":encoding(utf8)";
     my $sv = B::svref_2object(\$_);
     printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
     '
     utf8 "\x80" does not map to Unicode.
     E9 (internal: C3.A9, UTF8=1)
     FFFF (internal: EF.BF.BF, UTF8=1)
     D800 (internal: ED.A0.80, UTF8=1)
     200000 (internal: F8.88.80.80.80, UTF8=1)
     5C.78.38.30 = \x80 (internal: 5C.78.38.30, UTF8=1)
    
  • :utf8

     $ printf "\xC3\xA9\n\xEF\xBF\xBF\n\xED\xA0\x80\n\xF8\x88\x80\x80\x80\n\x80\n" |
     perl -MB -nle'
     use open ":std", ":utf8";
     my $sv = B::svref_2object(\$_);
     printf "%vX%s (internal: %vX, UTF8=%d)\n", $_, length($_)==1 ? "" : " = $_", $sv->PVX, utf8::is_utf8($_);
     '
     E9 (internal: C3.A9, UTF8=1)
     FFFF (internal: EF.BF.BF, UTF8=1)
     D800 (internal: ED.A0.80, UTF8=1)
     200000 (internal: F8.88.80.80.80, UTF8=1)
     Malformed UTF-8 character: \x80 (unexpected continuation byte 0x80, with no preceding start byte) in printf at -e line 4, <> line 5.
     0 (internal: 80, UTF8=1)
    

On write

  • :encoding(UTF-8)

     $ perl -e'
     use open ":std", ":encoding(UTF-8)";
     print "\x{E9}\n";
     print "\x{FFFF}\n";
     print "\x{D800}\n";
     print "\x{20_0000}\n";
     ' >a
     Unicode non-character U+FFFF is not recommended for open interchange in print at -e line 4.
     Unicode surrogate U+D800 is illegal in UTF-8 at -e line 5.
     Code point 0x200000 is not Unicode, may not be portable in print at -e line 6.
     "\x{ffff}" does not map to utf8.
     "\x{d800}" does not map to utf8.
     "\x{200000}" does not map to utf8.
     $ od -t c a
     0000000 303 251 \n \ x { F F F F } \n \ x { D
     0000020 8 0 0 } \n \ x { 2 0 0 0 0 0 } \n
     0000040
     $ cat a
     é
     \x{FFFF}
     \x{D800}
     \x{200000}
    
  • :encoding(utf8)

     $ perl -e'
     use open ":std", ":encoding(utf8)";
     print "\x{E9}\n";
     print "\x{FFFF}\n";
     print "\x{D800}\n";
     print "\x{20_0000}\n";
     ' >a
     Unicode surrogate U+D800 is illegal in UTF-8 at -e line 4.
     Code point 0x200000 is not Unicode, may not be portable in print at -e line 5.
     $ od -t c a
     0000000 303 251 \n 355 240 200 \n 370 210 200 200 200 \n
     0000015
     $ cat a
     é
     ▒
     ▒
    
  • :utf8

    Same results as :encoding(utf8).

Tested using Perl 5.26.


Encode::encode by default will replace invalid characters with a substitution character. Is that true even if you are passing the looser "utf8" as the encoding?

Perl strings are strings of 32-bit or 64-bit characters depending on the build. utf8 can encode any 72-bit integer. It is therefore capable of encoding all characters it can be asked to encode.

answered Feb 28, 2018 at 23:07
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.