Encode unicode codepoints to UTF-8 manually

Question 1

I want to encode unicode codepoints to UTF-8 manually. I wrote the following C# code. I tested it with some cases I know, but I would like to know if it's correct for all inputs. I know that Unicode codepoints are undefined beyond 0x10FFFF ,but I don't care about that. Therefore the output of my method might be more than 4 bytes.

private byte[] CodePointToUtf8 (int codepoint)
{
 if (codepoint < 0x80) {
 return new byte[]{ 
 (byte)(codepoint) 
 };
 } else if (codepoint < 0x800) { 
 return new byte[]{ 
 (byte)(0xC0 | (codepoint << 21 >> 27)), 
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x10000) {
 return new byte[] {
 (byte)(0xE0 | (codepoint << 16 >> 28)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x200000) {
 return new byte[] {
 (byte)(0xF0 | (codepoint << 11 >> 29)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x4000000) {
 return new byte[] {
 (byte)(0xF8 | (codepoint << 6 >> 30)),
 (byte)(0x80 | (codepoint << 8 >> 26)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else {
 return new byte[] {
 (byte)(0xFC | (codepoint << 1 >> 31)),
 (byte)(0x80 | (codepoint << 2 >> 26)),
 (byte)(0x80 | (codepoint << 8 >> 26)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 }
}

Bonus question: Is there an build in way to do that?

Question 2

Did you test with emojis? What you implemented is UTF-8-1993, but the current version is UTF-8-2003 (see Wikipedia), and the latter version encodes emojis differently.

Question 3

I tested some emojis (eg U+1F600) and it worked for them. If I understand this correctly UTF-8-2003 is included in UTF-8-1993 since 10FFFF (last codepoint 2003) < 1FFFFF (last 4 byte codepoint 1993). But maybe I got this completely wrong?

Question 4

Now that I think of it, I'm the one who is completely wrong. Sorry for the confusion.

Question 5

Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single >> followed by a bit mask, e.g. (codepoint >> 12) & 0x3F to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx bytes have the same bitmask.

Your code omits some validity checks:

codepoint could be < 0
codepoint could be between 0xD800 and 0xDFFF

Other than these, it is perfect.

I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.

Question 6

A few moments ago I found out that at least in Unity3D where this code was supposed to run the rightshift >> is the arithmetical rightshift which means that a negative number shifted righ will be padded with 1s instead of 0s. Because of this I'm not longer sure if my code is correct.

Question 7

To fix this, you should declare codepoint as unsigned int. This also frees you from checking whether codepoint < 0.

Question 8

Oh, and since the codepoint < 0x80 branch doesn't contain any shifts, your code is fine right now. Just by coincidence, but nevertheless.

Question 9

UTF-8 Verification .NET

Bonus question: Is there an build in way to do that?

There is a built-in way to encode an unicode code point to UTF-8. I have checked some of the results against the UTF-8 Specification 2003, and I believe this method complies to it. Another interesting link is the UTF8Encoding Reference Source to see how this encoding works.

private byte[] CodePointToUtf8_BuiltIn(int codepoint)
{
 return new UTF8Encoding(true).GetBytes(new[] { (char)codepoint });
}

If we loop through the code points and filter out surrogates, we get some discrepancies between your algorithm and the built-in one.

internal const char HIGH_SURROGATE_START = '\ud800';
internal const char HIGH_SURROGATE_END = '\udbff';
internal const char LOW_SURROGATE_START = '\udc00';
internal const char LOW_SURROGATE_END = '\udfff'; 
for (int i = 0; i <= 0x10FFFF; i++)
{
 if (i >= HIGH_SURROGATE_START && i <= HIGH_SURROGATE_END) continue;
 if (i >= LOW_SURROGATE_START && i <= LOW_SURROGATE_END) continue;
 var op = CodePointToUtf8(i);
 var net = CodePointToUtf8_BuiltIn(i);
 CollectionAssert.AreEqual(net, op);
}

Here's a way to display the differences

var builder = new StringBuilder();
builder.AppendLine("0x" + i.ToString("X4"));
builder.AppendLine(string.Join(" - ", op.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
builder.AppendLine(string.Join(" - ", net.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
var text = builder.ToString();

And some differences

0x00A0
11000010 - 11100000
11000010 - 10100000
0x0400
11110000 - 10000000
11010000 - 10000000
0x0720
11111100 - 11100000
11011100 - 10100000
..

Could you explain the differences?

Roland Illig Roland Illig 21.8k2 gold badges36 silver badges83 bronze badges · Accepted Answer · 2016-12-11 20:25:42Z

Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single >> followed by a bit mask, e.g. (codepoint >> 12) & 0x3F to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx bytes have the same bitmask.

Your code omits some validity checks:

codepoint could be < 0
codepoint could be between 0xD800 and 0xDFFF

Other than these, it is perfect.

I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.

A few moments ago I found out that at least in Unity3D where this code was supposed to run the rightshift >> is the arithmetical rightshift which means that a negative number shifted righ will be padded with 1s instead of 0s. Because of this I'm not longer sure if my code is correct.
To fix this, you should declare codepoint as unsigned int. This also frees you from checking whether codepoint < 0.
Oh, and since the codepoint < 0x80 branch doesn't contain any shifts, your code is fine right now. Just by coincidence, but nevertheless.

Stack Exchange Network

Encode unicode codepoints to UTF-8 manually

2 Answers 2

UTF-8 Verification .NET

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Encode unicode codepoints to UTF-8 manually

2 Answers 2

UTF-8 Verification .NET

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions