I have piece of code in C which removes any duplicate characters in a string. But I am doing it in two loops and would like it to optimize it to one loop.
void removeDuplicates(char* ptr)
{
int end = 1;
int length = strlen(ptr);
int current;
int i;
current = 1;
for(end=1; end<length; end++)
{
for(i=0; i<end; i++)
{
if(ptr[end] == ptr[i])
break;
}
if(i == end)
{
ptr[current] = ptr[i];
current++;
}
}
ptr[current] = 0;
}
int main(int argc, char** argv)
{
char str[256] = {0,};
gets(str);
removeDuplicates(str);
printf("%s\n", str);
return 1;
}
3 Answers 3
Remove Duplicates: (Re-Write Ant's code so it is readable).
#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr)
{
char seen[UCHAR_MAX+1] = {0};
//
// Smallest data type is a char (unless you want to bit
// twiddle, and that has been shown to be not worth it
// see the complaints about std::vector<bool> in C++).
// 256 bytes is not much.
//
// By using char as they type it prevents all the complex
// multiplication/division an bit twiddling that Ant was doing.
unsigned char* source = (unsigned char*)ptr;
unsigned char* dest = (unsigned char*)ptr;
unsigned char next;
do
{
next = *source; // Get next character.
if (!seen[next]) // Only enter the `if block` if we have not seen it.
{
seen[next] = 1; // Mark it as seen
*dest = next; // Move it to destination.
// Note: source will change faster than dest when we
// start seeing dupes, this acts as the copy back
// over the duplicates.
++dest;
}
++source;
}
while (next != '0円'); // Once we have copied the null terminator quit.
}
-
\$\begingroup\$ That's about 100x more readable, but about 100x less fun... ;) If you're making assumptions about the size of a char you should be able to get away with
char seen[127]
. ASCII is only 7-bit. \$\endgroup\$Ant– Ant2011年10月18日 15:35:18 +00:00Commented Oct 18, 2011 at 15:35 -
\$\begingroup\$ No. Its 50x less fun. Still lots of fun left. \$\endgroup\$Loki Astari– Loki Astari2011年10月18日 15:52:20 +00:00Commented Oct 18, 2011 at 15:52
I believe this should turn out to be faster and more efficient:
#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr) {
int seen[(sizeof(char) << 8) / (sizeof(int) * 8)] = {0,};
char* source = ptr;
char* dest = ptr;
while (*source != '0円') {
int destIndex = (*source) / (sizeof(int) * 8);
int destBit = 1 << (*source) % (sizeof(int) * 8);
if (!(seen[destIndex] & destBit)) {
*dest = *source;
++dest;
seen[destIndex] |= destBit;
}
++source;
}
*dest = '0円';
}
int main (int argc, const char * argv[])
{
char str[256] = {0,};
gets(str);
removeDuplicates(str);
printf("%s\n", str);
return 0;
}
It uses an array of ints as an array of booleans, each of which indicates whether a character has been seen. It eliminates the strlen()
call (which iterates over the entire string) and replaces it with a NULL check. It uses two pointers into the string, source
and dest
, which track the current read location (iterating over the original string) and the current write location (the destination for in-place copying of the next non-duplicate character).
-
\$\begingroup\$ Sorry you deserve the credit here. \$\endgroup\$Loki Astari– Loki Astari2014年12月04日 05:00:25 +00:00Commented Dec 4, 2014 at 5:00
If you're optimizing for time, you could waste some memory and use an array of length sizeof(char)
-- let's call it seen
-- initialized with zeros.
In the loop you'd a few things:
- if the
seen
count is greater than zero, increment anoffset
variable (the value of this offset is equal to how many duplicates of all characters have been found so far) - copy the current character to a new position
offset
characters behind, if it was not previously seen -- meaning, ifseen[current_char]
is zero (possibly only if offset is non-zero, otherwise "from" and "to" position are the same) - Increment
seen[current_char]
at the end of the loop
Note that current_char
is really the value of the character, not the index of the character in the input string.
gets()
even in test code. Forget it exists; usefgets()
instead. \$\endgroup\$