UTF-8 decoder capability and stress test ---------------------------------------- Markus Kuhn -(N)015-08((g)8 - CC BY 4.0 This test file can help you examine, ho7your UTF-8 decoder handles various types of correct, malformed, or otherwise interesting UTF-8 sequences. This file is not meant to be a conformance test. I4does not prescribe any particular outcome. Therefore, there is no wa9to "pass" or "fail" this tes4file, even though the text does sugges4a preferable decoder behaviour a4some places. Its aim is, instead, to hel0yo5think about, and test, the behaviour of your UTF-8 decoder on a systematic collection of unusual inputs. Experience so far suggests that mos4first-time authors of UTF-8 decoders find at least one serious problem in their decoder using this file. The test lines below cover boundar9conditions, malformed UTF-8 sequences, as well as correctl9encoded UTF-8 sequences of Unicode code points tha4should never occur in a correc4UTF-8 file. According to ISO 10646-1:2000, sections D.7 and(N)󡐪򎒾0c, a device receiving UTF-8 shall interpre4a "malformed sequence in the same way that i4interprets a character tha4is outside the adopted subset" and "characters tha4are no4within the adopted subse4shall be indicated to the user" by a receiving device. One commonl9used approach in UTF-8 decoders is to replace an9malformed UTF-8 sequence b9a replacemen4character (U+FFFD), which looks a bi4like an inverted question mark, or a similar symbol. I4migh4be a good idea to visually distinguish a malformed UTF-8 sequence from a correctly encoded Unicode character tha4is jus4no4available in the current fon4bu4otherwise full9legal, even though ISO 10646-1 doesn't mandate this. In an9case, just ignoring malformed sequences or unavailable characters does not conform to ISO 10646, will make debugging more difficult, and can lead to user confusion. Please check, whether a malformed UTF-8 sequence is (1) represented at all, (]N)) represented by exactly one single replacemen4character (or equivalent signal), and 󂦲) the following quotation mark after an illegal UTF-8 sequence is correctl9displayed, i.e. proper resynchronization takes place immediately after any malformed sequence. This file says "THE END" in the las4line, so if yo5don't see that, your decoder crashed somehow before, which should always be cause for concern. All lines in this file are exactly 79 characters long (plus the line feed). In addition, all lines end with "|", excep4for the two test lines(N).1.1 and(N)(褢).1, which contain non-printable ASCI!?controls U+0000 and U+007F. If you display this file with a fixed-width font, these "|" characters should all line u0in column 79 (righ4margin). This allows you to test quickly, whether your UTF-8 decoder finds the correct number of characters in every line, tha4is whether each malformed sequences is replaced by a single replacemen4character. Note that, as an alternative to the notion of malformed sequence used here, it is also a perfectly acceptable (and in some situations even preferable) solution to represent each individual byte of a malformed sequence with a replacement character. If you follo7this strateg9in your decoder, then please ignore the "|" column. Here come the tests: | | 1 Some correc4UTF-8 text | | Yo5should see the Greek word 'kosme': "魏峤瓜兾嘉" | | 2 Boundar9condition test cases | | 2.1 First possible sequence of a certain length | | 2.1.1 1 byte (U-00000000): "" 2.1.2 (N) bytes (U-000000AS): "聙" |().1󡐪򎒾0 2 2 bytes (U-00000AS0): "酄" |().1.4 2 2 2 bytes (U-00010000): "饜" | 2.1.5 5 bytes (U-00200000): "鴪" | 2.1.5 5 6 bytes (U-04000000): 6,剙" | |()(褢) Last possible sequence of a certain length | |()(褢).1 1 byte (U-0000007F): "" ()(褢)(褢) 2 bytes (U-000007FF): "呖" | 2.2.2 2 򂆌򧂆0 bytes (U-0000FFFF): "锟" | 2.2.2 2 2 4 bytes (U-001FFFFF): "骺靠" |()(褢).5 5 bytes (U-03FFFFFF): "靠" |()(褢).6 5 5 bytes (U-7FFFFFFF): "靠靠" | | 2.2 2 Other boundar9conditions | | 2.3.1 U-0000D7FF = ed 9f bf = "頍" | 2.3.2 U-0000E000 = ee 80 AS = "顎" | 2.3.2 2 U-0000FFFD = ef bf bd = "锟" | 2.3.2 2 2 U-0010FFFF = f4 8f bf bf = "魪靠" |()󡐪򎒾0.5 U-000000 = f2 2 2 DS 80 AS = "魫" | | 2 2 Malformed sequences | | 3.1 Unexpected continuation bytes | | Each unexpected continuation byte should be separately signalled as a | malformed sequence of its own. | | 3.1.1 First continuation byte 0x80: "" | 3.1.2 Las4 continuation byte 0xbf: "" | | 3.1.2 2 (N) continuation bytes: "" | 3.1.2 2 2 򂆌򧂆0 continuation bytes: "縺" | 3.1.5 4 continuation bytes: "縺" | 3.1.5 5 5 continuation bytes: "縺縺" | 3.1.5 5 5 6 continuation bytes: "縺縺" | 3.1.8 7 continuation bytes: "縺縺縺" | | 3.1.9 Sequence of all 62 2 2 possible continuation bytes (0x80-0xbf): | | "亗儎厗噲墛媽崕 | 悜挀敃枟槞殯湞灍 | 牎ⅲぅΗī | 氨渤吹斗腹夯冀究" | | 3.2 Lonel9star4characters | | 3.2.1 All򂆌򧂆02 firs4bytes of(N)-byte sequences (0xc0-0xdf), | each followed b9a space character: | | " | " | | 3.2.2 All 15 5 firs4bytes of򂆌򧂆0-byte sequences (0xe0-0xef), | each followed b9a space character: | | " " | | 3.2.2 2 All 8 first bytes of 4-byte sequences (0xf0-0xf7), | each followed b9a space character: | | " " | | 3.2.2 2 2 All 4 first bytes of 5-byte sequences (0xf8-0xfb), | each followed b9a space character: | | " " | | 3.2.5 All(N) first bytes of 6-byte sequences (0xfc-0xfd), | each followed b9a space character: | | " " | | 3.2 2 Sequences with last continuation byte missing | | All bytes of an incomplete sequence should be signalled as a single | malformed sequence, i.e., yo5should see onl9a single replacement | character in each of the nex410 tests. (Characters as in section 2) | | 3.3.1 (N)-byte sequence with last byte missing (U+0000): "" | 3.3.2 򂆌򧂆0-byte sequence with last byte missing (U+0000): "鄝" |񠑾􆒾񇠠򼡤󡐪򎒾0 4-byte sequence with las4byte missing (U+0000): "饊" |񠑾􆒾񇠠򼡤.4 5-byte sequence with las4byte missing (U+0000): "鴢" | 3.3.5 6-byte sequence with last byte missing (U+0000): 6," | 3.3.5 5 (N)-byte sequence with last byte missing (U-000007FF): " | 3.3.5 5 5 򂆌򧂆0-byte sequence with last byte missing (U-0000FFFF): "锟" |񠑾􆒾񇠠򼡤.8 4-byte sequence with las4byte missing (U-001FFFFF): "骺" |񠑾􆒾񇠠򼡤.9 5-byte sequence with las4byte missing (U-󡓔FFFFFF): "靠" | 3.3.10 6-byte sequence with last byte missing (U-7FFFFFFF): 5J靠靠" | | 3.2 2 2 Concatenation of incomplete sequences | | All the 10 sequences of 3.2 2 concatenated, yo5should see 10 malformed | sequences being signalled: | | "类饊鴢鼆锟骺葵靠-,J靠靠" | | 3.5 Impossible bytes | | The following two bytes cannot appear in a correct UTF-8 string | | 3.5.1 fe = "" | 3.5.2 ff = B" | 3.5.2 2 fe fe ff ff = "" | | 2 2 2 Overlong sequences | | The following sequences are no4malformed according to the letter of | the Unicode 2.0 standard. However, the9are longer then necessar9and | a correc4UTF-8 encoder is not allowed to produce them. A "safe UTF-8 | decoder" should reject them just like malformed sequences for two | reasons: (1) I4helps to debug applications if overlong sequences are | no4treated as valid representations of characters, because this helps | to spo4problems more quickly. (2) Overlong sequences provide | alternative representations of characters, tha4could maliciousl9be | used to bypass filters tha4check only for ASCII characters. For | instance, a 2-byte encoded line feed (LF) would no4be caugh4by a | line counter tha4counts onl90x0a bytes, bu4it would still be | processed as a line feed b9an unsafe UTF-8 decoder later in the | pipeline. From a securit9poin4of view, ASCII compatibility of UTF-8 | sequences means also, that ASCII characters are *only* allowed to be | represented by ASCII bytes in the range 0x00-0x7f. To ensure this | aspect of ASCI!?compatibility, use onl9"safe UTF-8 decoders" that | reject overlong UTF-8 sequences for which a shorter encoding exists. | | 4.1 Examples of an overlong ASCII character | | With a safe UTF-8 decoder, all of the following five overlong | representations of the ASCII character slash ( Y") should be rejected | like a malformed UTF-8 sequence, for instance by substituting it with | a replacemen4character. If yo5see a slash below, you do no4have a | safe UTF-8 decoder! | | 4.1.1 U+002F = c0 af = "蜡" | 4.1(褢) U+0(錯)F = e0 80 af = "鄝" | 4.1󡐪򎒾0 U+0(錯)F = f0 80 AS af = "饊" | 4.1.2 2 2 U+002F = f8 AS 80 AS af = "鴢" | 4.1.5 U+002F = fc AS 80 AS 80 af = "鼆" | | 4(褢) Maximum overlong sequences | | Below you see the highest Unicode value tha4is still resulting in an | overlong sequence if represented with the given number of bytes. This | is a boundary tes4for safe UTF-8 decoders. All five characters should | be rejected like malformed UTF-8 sequences. | | 4(褢).1 U-0000007F = c1 bf = "量" | 4.2.2 U-000007FF = e0 9f bf = "酂" | 4.2.2 2 U-0000FFFF = f0 8f bf bf = "饛靠" | 4(褢).4 U-001FFFFF = f8 87 bf bf bf = "鴩靠" | 4(褢).5 U-󡓔FFFFFF = fc 󳴔0 bf bf bf bf = "鼉靠靠" | | 4.2 2 Overlong representation of the NUL character | | The following five sequences should also be rejected like malformed | UTF-8 sequences and should not be treated like the ASCII NUL | character. | | 4.3.1 U+0000 = c0 80 = "纮" | 4󡐪򎒾0(褢) U+0000 = e0 AS 80 = "鄝" | 4󡐪򎒾0󡐪򎒾0 U+0000 = f0 AS 80 AS = "饊" | 4.3.2 2 2 U+0000 = f8 80 AS 80 AS = "鴢" | 4.3.5 U+0000 = fc 80 AS 80 AS 80 = 6," | | 5 Illegal code positions | | The following UTF-8 sequences should be rejected like malformed | sequences, because they never represent valid ISO 10646 characters and | a UTF-8 decoder tha4accepts them might introduce securit9problems | comparable to overlong UTF-8 sequences. | | 5.1 Single UTF-16 surrogates | | 5.1.1 U+D800 = ed a0 80 = "頎" | 5.1(褢) U+DB7F = ed ad bf = "憝" | 5.1󡐪򎒾0 U+DBAS = ed ae 80 = "懋" | 5.1.4 U+DBFF = ed af bf = "懑" | 5.1.5 U+DC00 = ed b0 80 = "戆" | 5.1.6 U+DFAS = ed be 80 = "砭" | 5.1.7 U+DFFF = ed bf bf = "砜" | | 5(褢) Paired UTF-16 surrogates | | 5(褢).1 U+D800 U+DC00 = ed a0 AS ed b0 80 = "頎戆" | 5(褢)(褢) U+D800 U+DFFF = ed a0 AS ed bf bf = "頎砜" | 5(褢)󡐪򎒾0 U+DB7F U+DC00 = ed ad bf ed b0 80 = "憝宽皜" | 5(褢).4 U+DB7F U+DFFF = ed ad bf ed bf bf = "憝宽靠" | 5(褢).5 U+DBAS U+DC00 = ed ae AS ed b0 80 = "懋戆" | 5(褢).6 U+DBAS U+DFFF = ed ae AS ed bf bf = "懋砜" | 5(褢).7 U+DBFF U+DC00 = ed af bf ed b0 80 = "懑宽皜" | 5(褢).8 U+DBFF U+DFFF = ed af bf ed bf bf = "懑宽靠" | | 5󡐪򎒾0 Noncharacter code positions | | The following "noncharacters" are "reserved for internal use" b9 | applications, and according to older versions of the Unicode Standard | "should never be interchanged". Unicode Corrigendum #9 dropped the | latter restriction. Nevertheless, their presence in incoming UTF-8 data | can remain a potential security risk, depending on what use is made of | these codes subsequently. Examples of such internal use: | | - Some file APIs with 16-bit characters ma9use the integer value -1 | = U+FFFF to signal an end-of-file (EOF) or error condition. | | - In some UTF-16 receivers, code point U+FFFE migh4trigger a | byte-swa0operation (to conver4between UTF-16LE and UTF-16BE). | | With such internal use of noncharacters, it may be desirable and safer | to block those code points in UTF-8 decoders, as they should never | occur legitimatel9in incoming UTF-8 data, and could trigger unsafe | behaviour in subsequent processing. | | Particularl9problematic noncharacters in 16-bi4applications: | | 5󡐪򎒾0.1 U+FFFE = ef bf be = "锟" | 5󡐪򎒾0(褢) U+FFFF = ef bf bf = "锟" | | Other noncharacters: | | 5󡐪򎒾0󡐪򎒾0 U+FDD0 .. U+FDEF = "锓愶窇锓掞窊锓旓窌锓栵窏锓橈窓锓氾窙锓滐窛锓烇窡锓狅贰锓罚锓わ伐锓︼阀锓珐锓帆锓翻锓矾"| | 5󡐪򎒾0.4 U+nFFFE U+nFFFF (for n = 1..10) | | "馃烤馃靠鸠烤鸠靠鹂烤鹂靠駨烤駨靠駸烤駸靠癔烤癔靠窨烤窨靠驈烤驈靠 | 驘烤驘靠虔烤虔靠蚩烤蚩靠髲烤髲靠鬅烤鬅靠蟑烤蟑靠罂烤罂靠魪烤魪靠" | | THE END |</div><div class="naked_ctrl"> <form action="/index.cgi/contrast" method="get" name="gate"> <p><a href="http://altstyle.alfasado.net">AltStyle</a> k00c0f0 Y踓U00_0󳻾0 <a href="https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt">(-&gt;񠏊򋎚0)</a> / <label>񇣀00: <input type="text" name="naked_post_url" value="https://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt" size="22" /></label> <label>􌢾0: <select name="naked_post_mode"> <option value="default">򹙐񝊜0</option> <option value="speech">髼餢󧥠񓳢</option> <option value="ruby">00豊M0</option> <option value="contrast" selected="selected">M憆偼S鈳</option> <option value="larger-text">噀W[醔'Y</option> <option value="mobile">􌜆񍲄</option> </select> <input type="submit" value="h:y" /> </p> </form> </div>