Return to Revisions

2 of 12

added 97 characters in body

edited Nov 26, 2015 at 16:18

10.5k
1
24
48

Algorithm can't be improved (or at least I see no reason to do it, ignoring a minor micro-optimization about List<string> usage which helps to keep code much more readable) and code is pretty clear, however:

You're returning string[] but consumers may not need it. List.ToArray() won't simply return internal array (because its capacity != size). It implies you will make another (relatively expansive) copy. I'd simply return IEnumerable<string>: if consumer needs an array it can call ToArray() LINQ extension method (which will/may check for underlying implementation), it he doesn't need an array then you saved an Array.Copy().
You're partitioning over Char but String is UTF-16 encoded then you may produce broken strings in, at least, two cases:

Unicode code point for that character is encoded as two UTF-16 code units, each code unit may end up in two different slices (and both strings will be invalid).
You're dealing with a character made by two separate Unicode code points (for example Han character 𠀑). Note that this is more common than you may think (not just when dealing with Korean text), think for example about Unicode combining characters like U+0300 COMBINING GRAVE ACCENT to build à).

To address both issues I'd use:

string[] characters = StringInfo.GetTextElementEnumerator(value).ToArray();

And I'd perform subsequent processing over characters array instead of value. Note that this may lead to strings made of different length but with the same number of characters.

If you're processing US ASCII text then you can simply ignore these issues.

answered Nov 26, 2015 at 16:11

Adriano Repetti

10.5k
1
24
48

default