A test data set for auto-testing UTF-8 string validator

Question 1

I wrote the UTF-8 string validator function.

The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.

If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.

Now I want to write auto-tests for my validator.

Is there a data-set that I can reuse?

I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.

Any clues?

Question 2

The link you mention at www.cl.cam.ac.uk and which I already now, is pretty old and does not conform to the actual Unicode recommendations for handling of error and use of the substitution character, as described in Unicode §3.9 (look at ill‐formed sequence maximal sub‐parts and their substitution with U+FFFD), hence, as a test file, it's not well suited to test a most conformant UTF‐8 decoder. It's OK, but it does not follow some of the Unicode recommendations.

Question 3

Valid UTF-8 data, to see that it passes
- Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
A string containing code points above U+10FFFF
Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Depending on how good your code is:

Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)

Those test cases:

Should pass: "ABC" 41 42 43
Should pass: "ABÇ" 41 42 c3 87
Should pass: "ABḈ" 41 42 e1 b8 88
Should pass: "ABς" 41 42 f0 9d 9c 8d
Should fail: Bad data 80 81 82 83
Should fail: Bad data C2 C3
Should fail: Overlong C0 43
Should fail: encodes F5 80 80 80
 U+140000
Should fail: encodes F4 90 80 80
 U+110000
Should fail: encodes ED A0 80
 U+D800

(I've only sorta checked these, so double-triple check me if you get unexpected results.)

Question 4

I ended up loading UTF-8-test.txt line-by-line comparing the result with expected (which I hardcoded in a map of line number->ok/fail).

This works, but I'm also would like to get some cases for incomplete UTF-8 characters, buffer overruns etc. So, if you know existing test suite (even not reusable one, but which can serve as an inspiration), please post a link here.

Thanatos Thanatos 44.6k17 gold badges99 silver badges152 bronze badges · Accepted Answer · 2011-04-02 23:26:06Z

Valid UTF-8 data, to see that it passes
- Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
A string containing code points above U+10FFFF
Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Depending on how good your code is:

Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)

Those test cases:

Should pass: "ABC" 41 42 43
Should pass: "ABÇ" 41 42 c3 87
Should pass: "ABḈ" 41 42 e1 b8 88
Should pass: "ABς" 41 42 f0 9d 9c 8d
Should fail: Bad data 80 81 82 83
Should fail: Bad data C2 C3
Should fail: Overlong C0 43
Should fail: encodes F5 80 80 80
 U+140000
Should fail: encodes F4 90 80 80
 U+110000
Should fail: encodes ED A0 80
 U+D800

(I've only sorta checked these, so double-triple check me if you get unexpected results.)

CollectivesTM on Stack Overflow

A test data set for auto-testing UTF-8 string validator

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related