In my (c#) code, I need to generate a string (from a longer string) which when UTF-8 encoded, is no longer than a given max length (in bytes).
public static string OfMaxBytes(this string str, int maxByteLength)
{
return str.Aggregate("", (s, c) =>
{
if (Encoding.UTF8.GetByteCount(s + c) <= maxByteLength)
{
s += c;
}
return s;
});
}
Usage looks like:
var shortName = longName.OfMaxBytes(32);
Does this look like a correct and sensible implementation?
1 Answer 1
It depends what you mean by correct. Consider this program:
const string Input = "a\u0304\u0308bc\u0327";
var bytes = Encoding.UTF8.GetByteCount(Input);
Console.WriteLine("{0} ({1} bytes in UTF-8)", Input, bytes);
for (var i = 0; i <= bytes; i++)
{
var result = Input.OfMaxBytes(i);
Console.WriteLine("{0} \"{1}\" {2}", i, result, Input.StartsWith(result, StringComparison.Ordinal));
}
Here is what your solution gives:
ā̈bç (9 bytes in UTF-8) 0 "" True 1 "a" True 2 "ab" False 3 "ā" True 4 "āb" False 5 "ā̈" True 6 "ā̈b" True 7 "ā̈bc" True 8 "ā̈bc" True 9 "ā̈bç" True
Granted, you might not come across such an input very often, but I don't think that's the result you want.
Two other points:
- You are building up a lot of intermediate strings. When you catch yourself doing that, see if you can use a
StringBuilder
instead. - You are iterating through the entire string, regardless of the value of
maxBytes
.
Here is what I would suggest:
public static string OfMaxBytes(this string input, int maxBytes)
{
if (maxBytes == 0 || string.IsNullOrEmpty(input))
{
return string.Empty;
}
var encoding = Encoding.UTF8;
if (encoding.GetByteCount(input) <= maxBytes)
{
return input;
}
var sb = new StringBuilder();
var bytes = 0;
var enumerator = StringInfo.GetTextElementEnumerator(input);
while (enumerator.MoveNext())
{
var textElement = enumerator.GetTextElement();
bytes += encoding.GetByteCount(textElement);
if (bytes <= maxBytes)
{
sb.Append(textElement);
}
else
{
break;
}
}
return sb.ToString();
}
Which gives this output:
ā̈bç (9 bytes in UTF-8) 0 "" True 1 "" True 2 "" True 3 "" True 4 "" True 5 "ā̈" True 6 "ā̈b" True 7 "ā̈b" True 8 "ā̈b" True 9 "ā̈bç" True
-
\$\begingroup\$ Thanks. I've taken your solution over mine. I naively assumed a .NET char and UTF-8 grapheme where the same thing. \$\endgroup\$Darragh– Darragh2014年06月24日 11:29:02 +00:00Commented Jun 24, 2014 at 11:29
-