4

I wrote the UTF-8 string validator function.

The function takes a buffer of bytes and its length in UTF-8 characters, and validates that the buffer consists exactly of given number of valid UTF-8 characters.

If buffer is too short or large, or if it contains invalid UTF8-characters, validation fails.

Now I want to write auto-tests for my validator.

Is there a data-set that I can reuse?

I've found this file: http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt, but it looks like that it does not suit my purposes well — it is more for visualization tests, as I understand.

Any clues?

asked Apr 1, 2011 at 21:58
1
  • 2
    The link you mention at www.cl.cam.ac.uk and which I already now, is pretty old and does not conform to the actual Unicode recommendations for handling of error and use of the substitution character, as described in Unicode §3.9 (look at ill‐formed sequence maximal sub‐parts and their substitution with U+FFFD), hence, as a test file, it's not well suited to test a most conformant UTF‐8 decoder. It's OK, but it does not follow some of the Unicode recommendations. Commented Apr 5, 2013 at 19:12

2 Answers 2

2
  • Valid UTF-8 data, to see that it passes
    • Strings containing characters needing 1 code unit, 2, 3, and 4! (Don't just test "ABC" or "café")
  • Clearly invalid data, say some ISO-8859-1 string (that isn't also valid UTF-8)
  • A string containing overlong forms (A 1-byte character encoded as 2, for example.) These should not pass as UTF-8
  • A string containing code points above U+10FFFF
  • Everything listed here: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences

Depending on how good your code is:

  • Catching a UTF-8 string that encodes anything from U+D800 to U+DFFF (surrogate pairs, which should never be present in a UTF-8 string)

Those test cases:

Should pass: "ABC" 41 42 43
Should pass: "ABÇ" 41 42 c3 87
Should pass: "ABḈ" 41 42 e1 b8 88
Should pass: "ABς" 41 42 f0 9d 9c 8d
Should fail: Bad data 80 81 82 83
Should fail: Bad data C2 C3
Should fail: Overlong C0 43
Should fail: encodes F5 80 80 80
 U+140000
Should fail: encodes F4 90 80 80
 U+110000
Should fail: encodes ED A0 80
 U+D800

(I've only sorta checked these, so double-triple check me if you get unexpected results.)

answered Apr 2, 2011 at 23:26
0

I ended up loading UTF-8-test.txt line-by-line comparing the result with expected (which I hardcoded in a map of line number->ok/fail).

This works, but I'm also would like to get some cases for incomplete UTF-8 characters, buffer overruns etc. So, if you know existing test suite (even not reusable one, but which can serve as an inspiration), please post a link here.

answered Apr 2, 2011 at 23:03

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.