I want to encode unicode codepoints to UTF-8 manually. I wrote the following C# code. I tested it with some cases I know, but I would like to know if it's correct for all inputs. I know that Unicode codepoints are undefined beyond 0x10FFFF
,but I don't care about that. Therefore the output of my method might be more than 4 bytes.
private byte[] CodePointToUtf8 (int codepoint)
{
if (codepoint < 0x80) {
return new byte[]{
(byte)(codepoint)
};
} else if (codepoint < 0x800) {
return new byte[]{
(byte)(0xC0 | (codepoint << 21 >> 27)),
(byte)(0x80 | (codepoint << 26 >> 26))
};
} else if (codepoint < 0x10000) {
return new byte[] {
(byte)(0xE0 | (codepoint << 16 >> 28)),
(byte)(0x80 | (codepoint << 20 >> 26)) ,
(byte)(0x80 | (codepoint << 26 >> 26))
};
} else if (codepoint < 0x200000) {
return new byte[] {
(byte)(0xF0 | (codepoint << 11 >> 29)),
(byte)(0x80 | (codepoint << 14 >> 26)),
(byte)(0x80 | (codepoint << 20 >> 26)) ,
(byte)(0x80 | (codepoint << 26 >> 26))
};
} else if (codepoint < 0x4000000) {
return new byte[] {
(byte)(0xF8 | (codepoint << 6 >> 30)),
(byte)(0x80 | (codepoint << 8 >> 26)),
(byte)(0x80 | (codepoint << 14 >> 26)),
(byte)(0x80 | (codepoint << 20 >> 26)) ,
(byte)(0x80 | (codepoint << 26 >> 26))
};
} else {
return new byte[] {
(byte)(0xFC | (codepoint << 1 >> 31)),
(byte)(0x80 | (codepoint << 2 >> 26)),
(byte)(0x80 | (codepoint << 8 >> 26)),
(byte)(0x80 | (codepoint << 14 >> 26)),
(byte)(0x80 | (codepoint << 20 >> 26)) ,
(byte)(0x80 | (codepoint << 26 >> 26))
};
}
}
Bonus question: Is there an build in way to do that?
-
1\$\begingroup\$ Did you test with emojis? What you implemented is UTF-8-1993, but the current version is UTF-8-2003 (see Wikipedia), and the latter version encodes emojis differently. \$\endgroup\$Roland Illig– Roland Illig2016年12月11日 13:08:51 +00:00Commented Dec 11, 2016 at 13:08
-
1\$\begingroup\$ I tested some emojis (eg U+1F600) and it worked for them. If I understand this correctly UTF-8-2003 is included in UTF-8-1993 since 10FFFF (last codepoint 2003) < 1FFFFF (last 4 byte codepoint 1993). But maybe I got this completely wrong? \$\endgroup\$Eric– Eric2016年12月11日 13:19:57 +00:00Commented Dec 11, 2016 at 13:19
-
1\$\begingroup\$ Now that I think of it, I'm the one who is completely wrong. Sorry for the confusion. \$\endgroup\$Roland Illig– Roland Illig2016年12月11日 20:01:11 +00:00Commented Dec 11, 2016 at 20:01
2 Answers 2
Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single >>
followed by a bit mask, e.g. (codepoint >> 12) & 0x3F
to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx
bytes have the same bitmask.
Your code omits some validity checks:
codepoint
could be< 0
codepoint
could be between0xD800
and0xDFFF
Other than these, it is perfect.
I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.
-
\$\begingroup\$ A few moments ago I found out that at least in Unity3D where this code was supposed to run the rightshift >> is the arithmetical rightshift which means that a negative number shifted righ will be padded with 1s instead of 0s. Because of this I'm not longer sure if my code is correct. \$\endgroup\$Eric– Eric2016年12月12日 20:14:42 +00:00Commented Dec 12, 2016 at 20:14
-
1\$\begingroup\$ To fix this, you should declare
codepoint
asunsigned int
. This also frees you from checking whethercodepoint < 0
. \$\endgroup\$Roland Illig– Roland Illig2016年12月12日 23:17:02 +00:00Commented Dec 12, 2016 at 23:17 -
\$\begingroup\$ Oh, and since the
codepoint < 0x80
branch doesn't contain any shifts, your code is fine right now. Just by coincidence, but nevertheless. \$\endgroup\$Roland Illig– Roland Illig2016年12月12日 23:18:24 +00:00Commented Dec 12, 2016 at 23:18
UTF-8 Verification .NET
Bonus question: Is there an build in way to do that?
There is a built-in way to encode an unicode code point to UTF-8. I have checked some of the results against the UTF-8 Specification 2003, and I believe this method complies to it. Another interesting link is the UTF8Encoding Reference Source to see how this encoding works.
private byte[] CodePointToUtf8_BuiltIn(int codepoint)
{
return new UTF8Encoding(true).GetBytes(new[] { (char)codepoint });
}
If we loop through the code points and filter out surrogates, we get some discrepancies between your algorithm and the built-in one.
internal const char HIGH_SURROGATE_START = '\ud800';
internal const char HIGH_SURROGATE_END = '\udbff';
internal const char LOW_SURROGATE_START = '\udc00';
internal const char LOW_SURROGATE_END = '\udfff';
for (int i = 0; i <= 0x10FFFF; i++)
{
if (i >= HIGH_SURROGATE_START && i <= HIGH_SURROGATE_END) continue;
if (i >= LOW_SURROGATE_START && i <= LOW_SURROGATE_END) continue;
var op = CodePointToUtf8(i);
var net = CodePointToUtf8_BuiltIn(i);
CollectionAssert.AreEqual(net, op);
}
Here's a way to display the differences
var builder = new StringBuilder();
builder.AppendLine("0x" + i.ToString("X4"));
builder.AppendLine(string.Join(" - ", op.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
builder.AppendLine(string.Join(" - ", net.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
var text = builder.ToString();
And some differences
0x00A0
11000010 - 11100000
11000010 - 10100000
0x0400
11110000 - 10000000
11010000 - 10000000
0x0720
11111100 - 11100000
11011100 - 10100000
..
Could you explain the differences?