Motivation
I came across an interesting question on SO: determine-if-string-has-all-unique-characters and thought about providing an extension method to enable duplicate removal given some kind of normalization.
Description
The goals:
- Remove any duplicate characters (keep the first occurence) and return the updated string.
- The method should be able to handle diacritics and extended unicode characters.
- The method should allow the consumer to control normalization to find duplicates (for instance, case-sensitivity).
Questions
- Looking for general feedback on C# conventions
- Performance feedback
- Am I reinventing the wheel?
Code
public static class StringExtension
{
public static string RemoveDuplicateChars(
this string text, Func<string, string> normalizer = null)
{
var output = new StringBuilder();
var entropy = new HashSet<string>();
var iterator = StringInfo.GetTextElementEnumerator(text);
if (normalizer == null)
{
normalizer = x => x.Normalize();
}
while (iterator.MoveNext())
{
var character = iterator.GetTextElement();
if (entropy.Add(normalizer(character)))
{
output.Append(character);
}
}
return output.ToString();
}
}
Unit Tests
Let's test a string that contains variations on the letter A
, including the Angstrom sign Å
. The Angstrom sign has unicode codepoint u212B
, but can also be constructed as the letter A
with the diacritic u030A
. Both represent the same character.
[TestClass]
public class Fixtures
{
[TestMethod]
public void Fixture()
{
// ÅÅAaA -> ÅAa
Assert.AreEqual("ÅAa", "\u212BA\u030AAaA"
.RemoveDuplicateChars());
// ÅÅAaA -> ÅA
// Note that the ToLowerInvariant is used to normalize characters
// when searching for duplicates, it does not mean the output gets
// transformed to lower case.
Assert.AreEqual("ÅA", "\u212BA\u030AAaA"
.RemoveDuplicateChars(x => x.Normalize().ToLowerInvariant()));
}
}
1 Answer 1
There are not much to say other than the usual missing argument null check:
It is valid to write the following:
string test = null;
test.RemoveDuplicateChars();
and RemoveDuplicateChars
will be called with a null
for the this
argument text
. Therefore you'll have to test for null
:
public static string RemoveDuplicateChars(
this string text, Func<string, string> normalizer = null)
{
if (text == null)
return text;
...
or throw an exception...
The default initialization of normalizer
could be a little less verbose:
normalizer = normalizer ?? ((x) => x.Normalize());
A minor detail: Angstrom and Å: is also represented by \u00C5
, which your code interprets as equal to Angstrom, but MS Word interprets them as different when using its Find-function?
-
1\$\begingroup\$ Your last point is in particular interesting. Both \u00C5 and \u212B are the Angstrom symbol. The lambda that you rewrote in a more compact way handles normalisation. This means that both these characters represent the same glyph. So normalized, they are exactly the same: docs.microsoft.com/en-us/dotnet/api/…. I bet MS Word does not normalize, or normalizes using a different algorithm. \$\endgroup\$dfhwze– dfhwze2019年09月24日 15:13:13 +00:00Commented Sep 24, 2019 at 15:13
-
1\$\begingroup\$ @dfhwze: My point about Angstrom was just to communicate the difference because I observed it - not to tell what is most correct - and I don't think that Word is the most realiable witness of truth :) \$\endgroup\$user73941– user739412019年09月24日 15:27:17 +00:00Commented Sep 24, 2019 at 15:27
-
1\$\begingroup\$ That's the thing, it's hard to tell what is correct, because it depends on the rules to which equivalence is checked. That's why I found your last paragraph spot on. Finding duplicates in Unicode characters is context-bound, maybe even subjective to some point. \$\endgroup\$dfhwze– dfhwze2019年09月24日 15:30:09 +00:00Commented Sep 24, 2019 at 15:30
Explore related questions
See similar questions with these tags.