What is an efficient algorithm to removing all duplicates in a string?
For example : aaaabbbccdbdbcd
Required result: abcd
-
Do you need to maintain / impose order on the result?Rob Fonseca-Ensor– Rob Fonseca-Ensor2010年02月18日 07:13:52 +00:00Commented Feb 18, 2010 at 7:13
19 Answers 19
You use a hashtable to store currently discovered keys (access O(1)) and then loop through the array. If a character is in the hashtable, discard it. If it isn't add it to the hashtable and a result string.
Overall: O(n) time (and space).
The naive solution is to search for the character is the result string as you process each one. That O(n2).
9 Comments
This closely related to the question: Detecting repetition with infinite input.
The hashtable approach may not be optimal depending on your input. Hashtables have a certain amount of overhead (buckets, entry objects). It is huge overhead compared to the actual stored char. (If you target environment is Java it is even worse as the HashMap is of type Map<Character,?>
.) The worse case runtime for a Hashtable access is O(n) due to collisions.
You need only 8kb too represent all 2-byte unicode characters in a plain BitSet. This may be optimized if your input character set is more restricted or by using a compressed BitSets (as long as you have a sparse BitSet). The runtime performance will be favorable for a BitSet it is O(1).
2 Comments
BitSet
to implement your own HashTable
as a proof that the HashTable
is a perfectly viable solution.In Python
>>> ''.join(set("aaaabbbccdbdbcd"))
'acbd'
If the order needs to be preserved
>>> q="aaaabbbccdbdbcd" # this one is not
>>> ''.join(sorted(set(q),key=q.index)) # so efficient
'abcd'
or
>>> S=set()
>>> res=""
>>> for c in "aaaabbbccdbdbcd":
... if c not in S:
... res+=c
... S.add(c)
...
>>> res
'abcd'
or
>>> S=set()
>>> L=[]
>>> for c in "aaaabbbccdbdbcd":
... if c not in S:
... L.append(c)
... S.add(c)
...
>>> ''.join(L)
'abcd'
In python3.1
>>> from collections import OrderedDict
>>> ''.join(list(OrderedDict((c,0) for c in "aaaabbbccdbdbcd").keys()))
'abcd'
2 Comments
Keep an array of 256 "seen" booleans, one for each possible character. Stream your string. If you haven't seen the character before, output it and set the "seen" flag for that character.
PHP algorythm - O(n):
function remove_duplicate_chars($str) {
if (2 > $len = strlen($str)) {
return $str;
}
$flags = array_fill(0,256,false);
$flags[ord($str[0])]=true;
$j = 1;
for ($i=1; $i<$len; $i++) {
$ord = ord($str[$i]);
if (!$flags[$ord]) {
$str[$j] = $str[$i];
$j++;
$flags[$ord] = true;
}
}
if ($j<$i) { //if duplicates removed
$str = substr($str,0,$j);
}
return $str;
}
echo remove_duplicate_chars('aaaabbbccdbdbcd'); // result: 'abcd'
Comments
You Can Do this in O(n) only if you are using HashTable. Code is given below Please Note- It is assumed that number of possible characters in input string are 256
void removeDuplicates(char *str)
{
int len = strlen(str); //Gets the length of the String
int count[256] = {0}; //initializes all elements as zero
int i;
for(i=0;i<len;i++)
{
count[str[i]]++;
if(count[str[i]] == 1)
printf("%c",str[i]);
}
}
Comments
#include <iostream>
#include<string>
using namespace std;
#define MAX_SIZE 256
int main()
{
bool arr[MAX_SIZE] = {false};
string s;
cin>>s;
int k = 0;
for(int i = 0; i < s.length(); i++)
{
while(arr[s[i]] == true && i < s.length())
{
i++;
}
if(i < s.length())
{
s[k] = s[i];
arr[s[k]] = true;
k++;
}
}
s.resize(k);
cout << s<< endl;
return 0;
}
4 Comments
arr[s[i]] == true
mean?arr[s[i]]
is what I'm interested in - do I understand correctly that you use char
(which is an int
) to index the array? string newString = new string("aaaaabbbbccccdddddd".ToCharArray().Distinct().ToArray());
or
char[] characters = "aaaabbbccddd".ToCharArray();
string result = string.Empty ;
foreach (char c in characters)
{
if (result.IndexOf(c) < 0)
result += c.ToString();
}
3 Comments
In C++, you'd probably use an std::set
:
std::string input("aaaabbbccddd");
std::set<char> unique_chars(input.begin(), input.end());
In theory you could use std::unordered_set
instead of std::set
, which should give O(N) expected overall complexity (though O(N2) worst case), where this one is O(N lg M) (where N=number of total characters, M=number of unique characters). Unless you have long strings with a lot of unique characters, this version will probably be faster though.
Comments
You can sort the string and then remove the duplicate characters.
#include <iostream>
#include <algorithm>
#include <string>
int main()
{
std::string s = "aaaabbbccdbdbcd";
std::sort(s.begin(), s.end());
s.erase(std::unique(s.begin(), s.end()), s.end());
std::cout << s << std::endl;
}
Comments
This sounds like a perfect use for automata.
Comments
C++ - O(n) time, O(1) space, and the output is sorted.
std::string characters = "aaaabbbccddd";
std::vector<bool> seen(std::numeric_limits<char>::max()-std::numeric_limits<char>::min());
for(std::string::iterator it = characters.begin(), endIt = characters.end(); it != endIt; ++it) {
seen[(*it)-std::numeric_limits<char>::min()] = true;
}
characters = "";
for(char ch = std::numeric_limits<char>::min(); ch != std::numeric_limits<char>::max(); ++ch) {
if( seen[ch-std::numeric_limits<char>::min()] ) {
characters += ch;
}
}
in C this is how i did it: O(n) in time since we only have one for loop.
void remDup(char *str)
{
int flags[256] = { 0 };
for(int i=0; i<(int)strlen(str); i++) {
if( flags[str[i]] == 0 )
printf("%c", str[i]);
flags[str[i]] = 1;
}
}
Comments
import java.util.HashSet;
public class RemoveDup {
public static String Duplicate()
{
HashSet h = new HashSet();
String value = new String("aaaabbbccdbdbcd");
String finalString = new String();
int stringLength = value.length();
for (int i=0;i<=stringLength-1;i++)
{
if(h.add(value.charAt(i)))
{
finalString = finalString + (value.charAt(i));
}
}
return finalString;
}
public static void main(String[] args) {
System.out.println(Duplicate());
}
}
Comments
get a list of first 26 prime numbers.. Now you can map each character (a,b,c,d etc) to each prime number.. (alphabetically say a=2, b=3, c=5 etc.. or depending upon relative abundance of the characters like most frequently used letter with lower prime say e=2, r=3, a=5 etc)...store that mapping in an integer array int prime[26]..
iterate through all the characters of the string
i=0;
int product = 1;
while(char[i] != null){
if(product % prime[i] == 0)
the character is already present delete it
else
product = product*prime[i];
}
this algorithm will work in O(n) time.. with O(1) space requirement It will work well when number of distinct character are less in the string... other wise product will exceed "int" range and we have to handle that case properly
Comments
int main()
{
std::string s = "aaacabbbccdbdbcd";
std::set<char> set1;
set1.insert(s.begin(), s.end());
for(set<char>::iterator it = set1.begin(); it!= set1.end(); ++it)
std::cout << *it;
return 0;
}
std::set takes O(log n) to insert
Comments
O(n) solution:
#include<stdio.h>
#include<string.h>
#include<stdlib.h>
void removeDuplicates(char *);
void removeDuplicates(char *inp)
{
int i=0, j=0, FLAG=0, repeat=0;
while(inp[i]!='0円')
{
if(FLAG==1)
{
inp[i-repeat]=inp[i];
}
if(j==(j | 1<<(inp[i]-'0円')))
{
repeat++;
FLAG=1;
}
j= j | 1<<(inp[i]-'0円');
i++;
}
inp[i-repeat]='0円';
}
int main()
{
char inp[100] = "aaAABCCDdefgccc";
//char inp[100] = "ccccc";
//char inp[100] = "0円";
//char *inp = (char *)malloc(sizeof(char)*100);
printf (" INPUT STRING : %s\n", inp);
removeDuplicates(inp);
printf (" OUTPUT STRING : %s:\n", inp);
return 1;
}
Comments
Perhaps the use of built in Python functions are more efficient that those "self made". Like this:
=====================
NOTE: maintain order
CODE
string = "aaabbbccc"
product = reduce((lambda x,y: x if (y in x) else x+y), string)
print product
OUTPUT
abc
=========================
NOTE: order neglected
CODE
string = "aaabssabcdsdwa"
str_uniq = ''.join(set(string))
print str_uniq
OUTPUT
acbdsw
Comments
# by using python
def cleantext(word):
if(len(word)==1):
return word
if word[0]==word[1]:
return cleantext(word[1:])
return word[0]+ cleantext(word[1:])
print(cleantext(word))