7
\$\begingroup\$

In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).

 public static string OfMaxBytes(this string str, int maxByteLength)
 {
 return str.Aggregate("", (s, c) =>
 {
 if (Encoding.UTF8.GetByteCount(s + c) <= maxByteLength)
 {
 s += c;
 }
 return s;
 });
 }

Usage looks like:

 var shortName = longName.OfMaxBytes(32);

Does this look like a correct and sensible implementation?

asked Jun 24, 2014 at 9:41
\$\endgroup\$

1 Answer 1

8
\$\begingroup\$

It depends what you mean by correct. Consider this program:

const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
 var result = Input.OfMaxBytes(i);
 Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}

Here is what your solution gives:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True

Granted, you might not come across such an input very often, but I don't think that's the result you want.

Two other points:

  • You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a StringBuilder instead.
  • You are iterating through the entire string, regardless of the value of maxBytes.

Here is what I would suggest:

public static string OfMaxBytes(this string input, int maxBytes)
{
 if (maxBytes == 0 || string.IsNullOrEmpty(input))
 {
 return string.Empty;
 }
 var encoding = Encoding.UTF8;
 if (encoding.GetByteCount(input) <= maxBytes)
 {
 return input;
 }
 var sb = new StringBuilder();
 var bytes = 0;
 var enumerator = StringInfo.GetTextElementEnumerator(input);
 while (enumerator.MoveNext())
 {
 var textElement = enumerator.GetTextElement();
 bytes += encoding.GetByteCount(textElement);
 if (bytes <= maxBytes)
 {
 sb.Append(textElement);
 }
 else
 {
 break;
 }
 }
 return sb.ToString();
}

Which gives this output:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True
answered Jun 24, 2014 at 10:49
\$\endgroup\$
2
  • \$\begingroup\$ Thanks. I've taken your solution over mine. I naively assumed a .NET char and UTF-8 grapheme where the same thing. \$\endgroup\$ Commented Jun 24, 2014 at 11:29
  • \$\begingroup\$ That's an understandable assumption. There's more info on MSDN. \$\endgroup\$ Commented Jun 24, 2014 at 11:35

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.