Method to return a string of max length (in bytes vs. characters)

Question 1

In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).

 public static string OfMaxBytes(this string str, int maxByteLength)
 {
 return str.Aggregate("", (s, c) =>
 {
 if (Encoding.UTF8.GetByteCount(s + c) <= maxByteLength)
 {
 s += c;
 }
 return s;
 });
 }

Usage looks like:

 var shortName = longName.OfMaxBytes(32);

Does this look like a correct and sensible implementation?

Question 2

It depends what you mean by correct. Consider this program:

const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
 var result = Input.OfMaxBytes(i);
 Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}

Here is what your solution gives:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True

Granted, you might not come across such an input very often, but I don't think that's the result you want.

Two other points:

You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a StringBuilder instead.
You are iterating through the entire string, regardless of the value of maxBytes.

Here is what I would suggest:

public static string OfMaxBytes(this string input, int maxBytes)
{
 if (maxBytes == 0 || string.IsNullOrEmpty(input))
 {
 return string.Empty;
 }
 var encoding = Encoding.UTF8;
 if (encoding.GetByteCount(input) <= maxBytes)
 {
 return input;
 }
 var sb = new StringBuilder();
 var bytes = 0;
 var enumerator = StringInfo.GetTextElementEnumerator(input);
 while (enumerator.MoveNext())
 {
 var textElement = enumerator.GetTextElement();
 bytes += encoding.GetByteCount(textElement);
 if (bytes <= maxBytes)
 {
 sb.Append(textElement);
 }
 else
 {
 break;
 }
 }
 return sb.ToString();
}

Which gives this output:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True

Question 3

Thanks. I've taken your solution over mine. I naively assumed a .NET char and UTF-8 grapheme where the same thing.

Question 4

That's an understandable assumption. There's more info on MSDN.

mjolka mjolka 16.3k2 gold badges30 silver badges73 bronze badges · Accepted Answer · 2014-06-24 10:49:10Z

It depends what you mean by correct. Consider this program:

const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
 var result = Input.OfMaxBytes(i);
 Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}

Here is what your solution gives:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "a" True
2 "ab" False
3 "ā" True
4 "āb" False
5 "ā̈" True
6 "ā̈b" True
7 "ā̈bc" True
8 "ā̈bc" True
9 "ā̈bç" True

Granted, you might not come across such an input very often, but I don't think that's the result you want.

Two other points:

You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a StringBuilder instead.
You are iterating through the entire string, regardless of the value of maxBytes.

Here is what I would suggest:

public static string OfMaxBytes(this string input, int maxBytes)
{
 if (maxBytes == 0 || string.IsNullOrEmpty(input))
 {
 return string.Empty;
 }
 var encoding = Encoding.UTF8;
 if (encoding.GetByteCount(input) <= maxBytes)
 {
 return input;
 }
 var sb = new StringBuilder();
 var bytes = 0;
 var enumerator = StringInfo.GetTextElementEnumerator(input);
 while (enumerator.MoveNext())
 {
 var textElement = enumerator.GetTextElement();
 bytes += encoding.GetByteCount(textElement);
 if (bytes <= maxBytes)
 {
 sb.Append(textElement);
 }
 else
 {
 break;
 }
 }
 return sb.ToString();
}

Which gives this output:

ā̈bç (9 bytes in UTF-8)
0 "" True
1 "" True
2 "" True
3 "" True
4 "" True
5 "ā̈" True
6 "ā̈b" True
7 "ā̈b" True
8 "ā̈b" True
9 "ā̈bç" True

Thanks. I've taken your solution over mine. I naively assumed a .NET char and UTF-8 grapheme where the same thing.
That's an understandable assumption. There's more info on MSDN.

Stack Exchange Network

Method to return a string of max length (in bytes vs. characters)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Method to return a string of max length (in bytes vs. characters)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions