4
\$\begingroup\$

I want to encode unicode codepoints to UTF-8 manually. I wrote the following C# code. I tested it with some cases I know, but I would like to know if it's correct for all inputs. I know that Unicode codepoints are undefined beyond 0x10FFFF ,but I don't care about that. Therefore the output of my method might be more than 4 bytes.

private byte[] CodePointToUtf8 (int codepoint)
{
 if (codepoint < 0x80) {
 return new byte[]{ 
 (byte)(codepoint) 
 };
 } else if (codepoint < 0x800) { 
 return new byte[]{ 
 (byte)(0xC0 | (codepoint << 21 >> 27)), 
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x10000) {
 return new byte[] {
 (byte)(0xE0 | (codepoint << 16 >> 28)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x200000) {
 return new byte[] {
 (byte)(0xF0 | (codepoint << 11 >> 29)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else if (codepoint < 0x4000000) {
 return new byte[] {
 (byte)(0xF8 | (codepoint << 6 >> 30)),
 (byte)(0x80 | (codepoint << 8 >> 26)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 } else {
 return new byte[] {
 (byte)(0xFC | (codepoint << 1 >> 31)),
 (byte)(0x80 | (codepoint << 2 >> 26)),
 (byte)(0x80 | (codepoint << 8 >> 26)),
 (byte)(0x80 | (codepoint << 14 >> 26)),
 (byte)(0x80 | (codepoint << 20 >> 26)) ,
 (byte)(0x80 | (codepoint << 26 >> 26))
 };
 }
}

Bonus question: Is there an build in way to do that?

t3chb0t
44.6k9 gold badges84 silver badges190 bronze badges
asked Dec 11, 2016 at 13:02
\$\endgroup\$
3
  • 1
    \$\begingroup\$ Did you test with emojis? What you implemented is UTF-8-1993, but the current version is UTF-8-2003 (see Wikipedia), and the latter version encodes emojis differently. \$\endgroup\$ Commented Dec 11, 2016 at 13:08
  • 1
    \$\begingroup\$ I tested some emojis (eg U+1F600) and it worked for them. If I understand this correctly UTF-8-2003 is included in UTF-8-1993 since 10FFFF (last codepoint 2003) < 1FFFFF (last 4 byte codepoint 1993). But maybe I got this completely wrong? \$\endgroup\$ Commented Dec 11, 2016 at 13:19
  • 1
    \$\begingroup\$ Now that I think of it, I'm the one who is completely wrong. Sorry for the confusion. \$\endgroup\$ Commented Dec 11, 2016 at 20:01

2 Answers 2

3
\$\begingroup\$

Yes, the code is correct for all valid code points. I was first confused about the double shifts, since I have never seen them before, but they do their job well. Other authors typically do a single >> followed by a bit mask, e.g. (codepoint >> 12) & 0x3F to skip the 12 bits to the right and take the next 6 bits. That way, the numbers can be verified more easily, since they are smaller. Plus, all the 01xxxxxx bytes have the same bitmask.

Your code omits some validity checks:

  • codepoint could be < 0
  • codepoint could be between 0xD800 and 0xDFFF

Other than these, it is perfect.

I know for sure that this conversion is built-in into C#, I just don't know where. Try loading a file into a string using the UTF-8 encoding. During that loading, the built-in conversion code gets called.

answered Dec 11, 2016 at 20:25
\$\endgroup\$
3
  • \$\begingroup\$ A few moments ago I found out that at least in Unity3D where this code was supposed to run the rightshift >> is the arithmetical rightshift which means that a negative number shifted righ will be padded with 1s instead of 0s. Because of this I'm not longer sure if my code is correct. \$\endgroup\$ Commented Dec 12, 2016 at 20:14
  • 1
    \$\begingroup\$ To fix this, you should declare codepoint as unsigned int. This also frees you from checking whether codepoint < 0. \$\endgroup\$ Commented Dec 12, 2016 at 23:17
  • \$\begingroup\$ Oh, and since the codepoint < 0x80 branch doesn't contain any shifts, your code is fine right now. Just by coincidence, but nevertheless. \$\endgroup\$ Commented Dec 12, 2016 at 23:18
1
\$\begingroup\$

UTF-8 Verification .NET

Bonus question: Is there an build in way to do that?

There is a built-in way to encode an unicode code point to UTF-8. I have checked some of the results against the UTF-8 Specification 2003, and I believe this method complies to it. Another interesting link is the UTF8Encoding Reference Source to see how this encoding works.

private byte[] CodePointToUtf8_BuiltIn(int codepoint)
{
 return new UTF8Encoding(true).GetBytes(new[] { (char)codepoint });
}

If we loop through the code points and filter out surrogates, we get some discrepancies between your algorithm and the built-in one.

internal const char HIGH_SURROGATE_START = '\ud800';
internal const char HIGH_SURROGATE_END = '\udbff';
internal const char LOW_SURROGATE_START = '\udc00';
internal const char LOW_SURROGATE_END = '\udfff'; 
for (int i = 0; i <= 0x10FFFF; i++)
{
 if (i >= HIGH_SURROGATE_START && i <= HIGH_SURROGATE_END) continue;
 if (i >= LOW_SURROGATE_START && i <= LOW_SURROGATE_END) continue;
 var op = CodePointToUtf8(i);
 var net = CodePointToUtf8_BuiltIn(i);
 CollectionAssert.AreEqual(net, op);
}

Here's a way to display the differences

var builder = new StringBuilder();
builder.AppendLine("0x" + i.ToString("X4"));
builder.AppendLine(string.Join(" - ", op.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
builder.AppendLine(string.Join(" - ", net.Select(x => Convert.ToString(x, 2).PadLeft(8, '0'))));
var text = builder.ToString();

And some differences

0x00A0
11000010 - 11100000
11000010 - 10100000
0x0400
11110000 - 10000000
11010000 - 10000000
0x0720
11111100 - 11100000
11011100 - 10100000
..

Could you explain the differences?

answered Jul 14, 2019 at 11:30
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.