Removing duplicate Unicode characters

Question 1

Motivation

I came across an interesting question on SO: determine-if-string-has-all-unique-characters and thought about providing an extension method to enable duplicate removal given some kind of normalization.

Description

The goals:

Remove any duplicate characters (keep the first occurence) and return the updated string.

The method should be able to handle diacritics and extended unicode characters.

The method should allow the consumer to control normalization to find duplicates (for instance, case-sensitivity).

Questions

Looking for general feedback on C# conventions
Performance feedback
Am I reinventing the wheel?

Code

public static class StringExtension
{
 public static string RemoveDuplicateChars(
 this string text, Func<string, string> normalizer = null)
 {
 var output = new StringBuilder();
 var entropy = new HashSet<string>();
 var iterator = StringInfo.GetTextElementEnumerator(text);
 if (normalizer == null)
 {
 normalizer = x => x.Normalize();
 }
 while (iterator.MoveNext())
 {
 var character = iterator.GetTextElement();
 if (entropy.Add(normalizer(character)))
 {
 output.Append(character);
 }
 }
 return output.ToString();
 }
}

Unit Tests

Let's test a string that contains variations on the letter A, including the Angstrom sign Å. The Angstrom sign has unicode codepoint u212B, but can also be constructed as the letter A with the diacritic u030A. Both represent the same character.

[TestClass]
public class Fixtures
{
 [TestMethod]
 public void Fixture()
 {
 // ÅÅAaA -> ÅAa
 Assert.AreEqual("ÅAa", "\u212BA\u030AAaA"
 .RemoveDuplicateChars());
 // ÅÅAaA -> ÅA
 // Note that the ToLowerInvariant is used to normalize characters
 // when searching for duplicates, it does not mean the output gets
 // transformed to lower case.
 Assert.AreEqual("ÅA", "\u212BA\u030AAaA"
 .RemoveDuplicateChars(x => x.Normalize().ToLowerInvariant()));
 }
}

Question 2

There are not much to say other than the usual missing argument null check:

It is valid to write the following:

 string test = null;
 test.RemoveDuplicateChars();

and RemoveDuplicateChars will be called with a null for the this argument text. Therefore you'll have to test for null:

public static string RemoveDuplicateChars(
 this string text, Func<string, string> normalizer = null)
{
 if (text == null)
 return text;
 ...

or throw an exception...

The default initialization of normalizer could be a little less verbose:

 normalizer = normalizer ?? ((x) => x.Normalize());

A minor detail: Angstrom and Å: is also represented by \u00C5, which your code interprets as equal to Angstrom, but MS Word interprets them as different when using its Find-function?

Question 3

Your last point is in particular interesting. Both \u00C5 and \u212B are the Angstrom symbol. The lambda that you rewrote in a more compact way handles normalisation. This means that both these characters represent the same glyph. So normalized, they are exactly the same: docs.microsoft.com/en-us/dotnet/api/…. I bet MS Word does not normalize, or normalizes using a different algorithm.

Question 4

@dfhwze: My point about Angstrom was just to communicate the difference because I observed it - not to tell what is most correct - and I don't think that Word is the most realiable witness of truth :)

Question 5

That's the thing, it's hard to tell what is correct, because it depends on the rules to which equivalence is checked. That's why I found your last paragraph spot on. Finding duplicates in Unicode characters is context-bound, maybe even subjective to some point.

user73941user73941 · Accepted Answer · 2019-09-24 14:32:07Z

There are not much to say other than the usual missing argument null check:

It is valid to write the following:

 string test = null;
 test.RemoveDuplicateChars();

and RemoveDuplicateChars will be called with a null for the this argument text. Therefore you'll have to test for null:

public static string RemoveDuplicateChars(
 this string text, Func<string, string> normalizer = null)
{
 if (text == null)
 return text;
 ...

or throw an exception...

The default initialization of normalizer could be a little less verbose:

 normalizer = normalizer ?? ((x) => x.Normalize());

A minor detail: Angstrom and Å: is also represented by \u00C5, which your code interprets as equal to Angstrom, but MS Word interprets them as different when using its Find-function?

Your last point is in particular interesting. Both \u00C5 and \u212B are the Angstrom symbol. The lambda that you rewrote in a more compact way handles normalisation. This means that both these characters represent the same glyph. So normalized, they are exactly the same: docs.microsoft.com/en-us/dotnet/api/…. I bet MS Word does not normalize, or normalizes using a different algorithm.
@dfhwze: My point about Angstrom was just to communicate the difference because I observed it - not to tell what is most correct - and I don't think that Word is the most realiable witness of truth :)
That's the thing, it's hard to tell what is correct, because it depends on the rules to which equivalence is checked. That's why I found your last paragraph spot on. Finding duplicates in Unicode characters is context-bound, maybe even subjective to some point.

Stack Exchange Network

Removing duplicate Unicode characters

Motivation

Description

Questions

Code

Unit Tests

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Removing duplicate Unicode characters

Motivation

Description

Questions

Code

Unit Tests

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions