Remove duplicate chars from String

Question 1

I have retackled this problem using the help I received from here: Remove duplications from a Java String but this time using a LinkedHashSet since previosuly I was using a HashSet but the answer was out of order.

From this implementation my run time should be \$O(n)\$ correct? Does anyone see any room where I can improve my code or some mistakes?

public static void main(String[] args){
 String test = "Banana";
 LinkedHashSet knownChars = new LinkedHashSet();
 StringBuilder noDups = new StringBuilder();
 for(Character c : test.toCharArray()){
 if(!knownChars.contains(c)){
 knownChars.add(c);
 noDups.append(c);
 }
 }
 System.out.println("No duplicate string is: " + noDups);
}

Question 2

This approach works, but why are you using a LinkedHashSet? Just use a HashSet and it should work fine as well.

Question 3

I was previously using a HashSet but, the results would be printed out of order. For example, given the string , "Banana" the result would be "aBn" still correct but not in order.

Question 4

You can just say O(n), no need to say BigO everytime.

Question 5

You should tell that your code is based on this answer to your previous question.

Question 6

The below points are in no particular order:

Extraction into Methods

The utility and the test should be broken up into methods, i.e., you should have a function String removeDuplicates(String), containing the logic for removing duplicates.

Manual Boxing

Unnecessary manual boxing to Character, just use char and let javac take care of it on its own (JDK1.5+ supports autoboxing of primitives).

Why `LinkedHashSet`?

A HashSet can be used just fine here - the order of the resulting deduplicated String is determined by the order of insertion of characters into the StringBuilder, not the HashSet. (Believe me, I checked. Here you can also: http://ideone.com/K9Ku2p)
Note that the add method of Sets return a boolean, true if the set did not already contain the element and it has been added successfully, or false if the element was already present in the Set. Exploiting this makes the call to Set.contains(...) redundant, see the example code.

Generics

Use generics. Don't use raw collections - they can violate type safety. In your case, you might not realise the immediate benefit of doing so, but it is a good practice when scaling to larger programs. Here, using generics is as simple as changing LinkedHashSet knownChars = new LinkedHashSet(); to LinkedHashSet<Character> knownChars = new LinkedHashSet<>(); (JDK 1.7+ to get the diamond type inference, otherwise it has to be LinkedHashSet<Character> knownChars = new LinkedHashSet<Character>();, JDK 1.5+)

Space-time tradeoffs

To minimize the number of reallocations of the underlying buffers of StringBuilder or HashSet, initialize them with a default capacity of the largest possible size they could have, which is the length of the input String. Use the constructors which have an int capacity parameter. See the example code for details.
To avoid a gotcha involving the load factor (a parameter which decides how full a HashSet should be before it is resized) of the HashSet when initializing the HashSet with capacity in point 6 (the Hashset may be prematurely resized), also set the load factor to 1.0f, using the new HashSet(int capacity, float loadFactor) constructor overload.

Miscellaneous

Type to interfaces, e.g., use Set<Character> knownChars = new LinkedHashSet<>(); instead of LinkedHashSet<Character> knownChars = new LinkedHashSet<>();. This makes your code in general more resilient to refactoring, you can use a different Set implementation at any time by changing one word instead of 2.
Qualify your method parameters with final if you are not going to reassign them in any way - granted, String being immutable makes this redundant, in the sense that any reassignments done to input in removeDuplicates will not affect test in main, but it's a good practice anyway.
Better output messages - see the example code for an example.
Better variable naming - it's already quite good, but try to use full words. See the example code.

Example Code (Ideone):

import java.util.Set;
import java.util.HashSet;
// Store in a file `StringUtilities.java`
public class StringUtilities
{
 public static void main(String[] args)
 {
 String test = "Banana";
 System.out.println("Test string \"" + test + "\" with duplicates removed is: \"" + removeDuplicates(test) + "\"");
 }
 
 public static String removeDuplicates(final String input) {
 Set<Character> knownCharacters = new HashSet<>(input.length(), 1.0f);
 StringBuilder noDuplicates = new StringBuilder(input.length());
 for(char character : input.toCharArray()){
 if(knownCharacters.add(character)){
 noDuplicates.append(character);
 }
 }
 return noDuplicates.toString();
 }
}

Question 7

@mdfst13, Thanks for catching that terminology issue there between functions and methods, guess I've been doing too much Scala recently. I normally consider static methods not accessing state to be functions in Java, but I agree that using methods everywhere is more consistent. Also, by that redundancy comment I meant that since String is immutable, the caller's copy of whatever was passed to input wouldn't be altered either way - I'll edit to make it clearer.

Question 8

The maximal capacity of the set should be larger (length * 4 / 3) if resizing should be avoided as the parameter represents the internal array size, not the resize threshold. The if (!knownCharacters.contains(character)) is redundant, if (!knownCharacters.add(character)) noDuplicates.append(character); would be sufficient.

Question 9

@Nevay, about the 2nd point, I thought so too, but for whatever reason that way returns wrong results - check it yourself ("ana" with just add, "Ban" with add and contains).

Question 10

Sorry, typo in my last comment, it should be if (knownCharacters.add(character)) ....

Question 11

@TamoghnaChowdhury thank you for you insightful response. I did not notice how much I could improve this code. One of the eye openers was the generics and specifying the size of the stringbuilder.

Tamoghna Chowdhury Tamoghna Chowdhury 2,33110 silver badges22 bronze badges · Accepted Answer · 2017-07-01 09:02:05Z

The below points are in no particular order:

Extraction into Methods

The utility and the test should be broken up into methods, i.e., you should have a function String removeDuplicates(String), containing the logic for removing duplicates.

Manual Boxing

Unnecessary manual boxing to Character, just use char and let javac take care of it on its own (JDK1.5+ supports autoboxing of primitives).

Why `LinkedHashSet`?

A HashSet can be used just fine here - the order of the resulting deduplicated String is determined by the order of insertion of characters into the StringBuilder, not the HashSet. (Believe me, I checked. Here you can also: http://ideone.com/K9Ku2p)
Note that the add method of Sets return a boolean, true if the set did not already contain the element and it has been added successfully, or false if the element was already present in the Set. Exploiting this makes the call to Set.contains(...) redundant, see the example code.

Generics

Use generics. Don't use raw collections - they can violate type safety. In your case, you might not realise the immediate benefit of doing so, but it is a good practice when scaling to larger programs. Here, using generics is as simple as changing LinkedHashSet knownChars = new LinkedHashSet(); to LinkedHashSet<Character> knownChars = new LinkedHashSet<>(); (JDK 1.7+ to get the diamond type inference, otherwise it has to be LinkedHashSet<Character> knownChars = new LinkedHashSet<Character>();, JDK 1.5+)

Space-time tradeoffs

To minimize the number of reallocations of the underlying buffers of StringBuilder or HashSet, initialize them with a default capacity of the largest possible size they could have, which is the length of the input String. Use the constructors which have an int capacity parameter. See the example code for details.
To avoid a gotcha involving the load factor (a parameter which decides how full a HashSet should be before it is resized) of the HashSet when initializing the HashSet with capacity in point 6 (the Hashset may be prematurely resized), also set the load factor to 1.0f, using the new HashSet(int capacity, float loadFactor) constructor overload.

Miscellaneous

Type to interfaces, e.g., use Set<Character> knownChars = new LinkedHashSet<>(); instead of LinkedHashSet<Character> knownChars = new LinkedHashSet<>();. This makes your code in general more resilient to refactoring, you can use a different Set implementation at any time by changing one word instead of 2.
Qualify your method parameters with final if you are not going to reassign them in any way - granted, String being immutable makes this redundant, in the sense that any reassignments done to input in removeDuplicates will not affect test in main, but it's a good practice anyway.
Better output messages - see the example code for an example.
Better variable naming - it's already quite good, but try to use full words. See the example code.

Example Code (Ideone):

import java.util.Set;
import java.util.HashSet;
// Store in a file `StringUtilities.java`
public class StringUtilities
{
 public static void main(String[] args)
 {
 String test = "Banana";
 System.out.println("Test string \"" + test + "\" with duplicates removed is: \"" + removeDuplicates(test) + "\"");
 }
 
 public static String removeDuplicates(final String input) {
 Set<Character> knownCharacters = new HashSet<>(input.length(), 1.0f);
 StringBuilder noDuplicates = new StringBuilder(input.length());
 for(char character : input.toCharArray()){
 if(knownCharacters.add(character)){
 noDuplicates.append(character);
 }
 }
 return noDuplicates.toString();
 }
}

@mdfst13, Thanks for catching that terminology issue there between functions and methods, guess I've been doing too much Scala recently. I normally consider static methods not accessing state to be functions in Java, but I agree that using methods everywhere is more consistent. Also, by that redundancy comment I meant that since String is immutable, the caller's copy of whatever was passed to input wouldn't be altered either way - I'll edit to make it clearer.
The maximal capacity of the set should be larger (length * 4 / 3) if resizing should be avoided as the parameter represents the internal array size, not the resize threshold. The if (!knownCharacters.contains(character)) is redundant, if (!knownCharacters.add(character)) noDuplicates.append(character); would be sufficient.
@Nevay, about the 2nd point, I thought so too, but for whatever reason that way returns wrong results - check it yourself ("ana" with just add, "Ban" with add and contains).
Sorry, typo in my last comment, it should be if (knownCharacters.add(character)) ....
@TamoghnaChowdhury thank you for you insightful response. I did not notice how much I could improve this code. One of the eye openers was the generics and specifying the size of the stringbuilder.

Stack Exchange Network

Remove duplicate chars from String

1 Answer 1

Extraction into Methods

Manual Boxing

Why `LinkedHashSet`?

Generics

Space-time tradeoffs

Miscellaneous

Example Code (Ideone):

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Remove duplicate chars from String

1 Answer 1

Extraction into Methods

Manual Boxing

Why LinkedHashSet?

Generics

Space-time tradeoffs

Miscellaneous

Example Code (Ideone):

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Why `LinkedHashSet`?