I wrote the UTF-8 string validator function.
The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.
If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.
Now I want to write auto-tests for my validator.
Is there a data-set that I can reuse?
I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.
Any clues?
-
2The link you mention at www.cl.cam.ac.uk and which I already now, is pretty old and does not conform to the actual Unicode recommendations for handling of error and use of the substitution character, as described in Unicode §3.9 (look at ill‐formed sequence maximal sub‐parts and their substitution with U+FFFD), hence, as a test file, it's not well suited to test a most conformant UTF‐8 decoder. It's OK, but it does not follow some of the Unicode recommendations.Hibou57– Hibou572013年04月05日 19:12:25 +00:00Commented Apr 5, 2013 at 19:12
2 Answers 2
- Valid UTF-8 data, to see that it passes
- Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
- Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
- A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
- A string containing code points above U+10FFFF
- Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
Depending on how good your code is:
- Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)
Those test cases:
Should pass: "ABC" 41 42 43
Should pass: "ABÇ" 41 42 c3 87
Should pass: "ABḈ" 41 42 e1 b8 88
Should pass: "ABς" 41 42 f0 9d 9c 8d
Should fail: Bad data 80 81 82 83
Should fail: Bad data C2 C3
Should fail: Overlong C0 43
Should fail: encodes F5 80 80 80
U+140000
Should fail: encodes F4 90 80 80
U+110000
Should fail: encodes ED A0 80
U+D800
(I've only sorta checked these, so double-triple check me if you get unexpected results.)
I ended up loading UTF-8-test.txt
line-by-line comparing the result with expected (which I hardcoded in a map of line number->ok/fail
).
This works, but I'm also would like to get some cases for incomplete UTF-8 characters, buffer overruns etc. So, if you know existing test suite (even not reusable one, but which can serve as an inspiration), please post a link here.
Explore related questions
See similar questions with these tags.