Removing any duplicate characters in a string

Question 1

I have piece of code in C which removes any duplicate characters in a string. But I am doing it in two loops and would like it to optimize it to one loop.

void removeDuplicates(char* ptr)
{
 int end = 1;
 int length = strlen(ptr);
 int current;
 int i;
 current = 1;
 for(end=1; end<length; end++)
 {
 for(i=0; i<end; i++)
 {
 if(ptr[end] == ptr[i])
 break;
 }
 if(i == end)
 {
 ptr[current] = ptr[i];
 current++;
 }
 }
 ptr[current] = 0;
}
int main(int argc, char** argv)
{
 char str[256] = {0,};
 gets(str);
 removeDuplicates(str);
 printf("%s\n", str);
 return 1;
}

Question 2

Don't use gets() even in test code. Forget it exists; use fgets() instead.

Question 3

Remove Duplicates: (Re-Write Ant's code so it is readable).

#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr) 
{
 char seen[UCHAR_MAX+1] = {0};
 //
 // Smallest data type is a char (unless you want to bit
 // twiddle, and that has been shown to be not worth it
 // see the complaints about std::vector<bool> in C++).
 // 256 bytes is not much.
 // 
 // By using char as they type it prevents all the complex
 // multiplication/division an bit twiddling that Ant was doing.
 unsigned char* source = (unsigned char*)ptr;
 unsigned char* dest = (unsigned char*)ptr;
 unsigned char next;
 do
 {
 next = *source; // Get next character.
 if (!seen[next]) // Only enter the `if block` if we have not seen it.
 {
 seen[next] = 1; // Mark it as seen
 *dest = next; // Move it to destination.
 // Note: source will change faster than dest when we
 // start seeing dupes, this acts as the copy back
 // over the duplicates.
 ++dest;
 }
 ++source;
 }
 while (next != '0円'); // Once we have copied the null terminator quit.
}

Question 4

That's about 100x more readable, but about 100x less fun... ;) If you're making assumptions about the size of a char you should be able to get away with char seen[127]. ASCII is only 7-bit.

Question 5

No. Its 50x less fun. Still lots of fun left.

Question 6

I believe this should turn out to be faster and more efficient:

#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr) {
 int seen[(sizeof(char) << 8) / (sizeof(int) * 8)] = {0,};
 char* source = ptr;
 char* dest = ptr;
 while (*source != '0円') {
 int destIndex = (*source) / (sizeof(int) * 8);
 int destBit = 1 << (*source) % (sizeof(int) * 8);
 if (!(seen[destIndex] & destBit)) {
 *dest = *source;
 ++dest;
 seen[destIndex] |= destBit;
 }
 ++source;
 }
 *dest = '0円';
}
int main (int argc, const char * argv[])
{
 char str[256] = {0,};
 gets(str);
 removeDuplicates(str);
 printf("%s\n", str);
 return 0;
}

It uses an array of ints as an array of booleans, each of which indicates whether a character has been seen. It eliminates the strlen() call (which iterates over the entire string) and replaces it with a NULL check. It uses two pointers into the string, source and dest, which track the current read location (iterating over the original string) and the current write location (the destination for in-place copying of the next non-duplicate character).

Question 7

Sorry you deserve the credit here.

Question 8

If you're optimizing for time, you could waste some memory and use an array of length sizeof(char) -- let's call it seen -- initialized with zeros.

In the loop you'd a few things:

if the seen count is greater than zero, increment an offset variable (the value of this offset is equal to how many duplicates of all characters have been found so far)
copy the current character to a new position offset characters behind, if it was not previously seen -- meaning, if seen[current_char] is zero (possibly only if offset is non-zero, otherwise "from" and "to" position are the same)
Increment seen[current_char] at the end of the loop

Note that current_char is really the value of the character, not the index of the character in the input string.

Loki Astari Loki Astari 97.7k5 gold badges126 silver badges341 bronze badges · Answer 1 · 2011-10-18 15:08:03Z

Remove Duplicates: (Re-Write Ant's code so it is readable).

#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr) 
{
 char seen[UCHAR_MAX+1] = {0};
 //
 // Smallest data type is a char (unless you want to bit
 // twiddle, and that has been shown to be not worth it
 // see the complaints about std::vector<bool> in C++).
 // 256 bytes is not much.
 // 
 // By using char as they type it prevents all the complex
 // multiplication/division an bit twiddling that Ant was doing.
 unsigned char* source = (unsigned char*)ptr;
 unsigned char* dest = (unsigned char*)ptr;
 unsigned char next;
 do
 {
 next = *source; // Get next character.
 if (!seen[next]) // Only enter the `if block` if we have not seen it.
 {
 seen[next] = 1; // Mark it as seen
 *dest = next; // Move it to destination.
 // Note: source will change faster than dest when we
 // start seeing dupes, this acts as the copy back
 // over the duplicates.
 ++dest;
 }
 ++source;
 }
 while (next != '0円'); // Once we have copied the null terminator quit.
}

That's about 100x more readable, but about 100x less fun... ;) If you're making assumptions about the size of a char you should be able to get away with char seen[127]. ASCII is only 7-bit.

Ant Ant 9567 silver badges11 bronze badges · Answer 2 · 2011-10-18 14:23:57Z

I believe this should turn out to be faster and more efficient:

#include <stdio.h>
void removeDuplicates(char* ptr);
void removeDuplicates(char* ptr) {
 int seen[(sizeof(char) << 8) / (sizeof(int) * 8)] = {0,};
 char* source = ptr;
 char* dest = ptr;
 while (*source != '0円') {
 int destIndex = (*source) / (sizeof(int) * 8);
 int destBit = 1 << (*source) % (sizeof(int) * 8);
 if (!(seen[destIndex] & destBit)) {
 *dest = *source;
 ++dest;
 seen[destIndex] |= destBit;
 }
 ++source;
 }
 *dest = '0円';
}
int main (int argc, const char * argv[])
{
 char str[256] = {0,};
 gets(str);
 removeDuplicates(str);
 printf("%s\n", str);
 return 0;
}

It uses an array of ints as an array of booleans, each of which indicates whether a character has been seen. It eliminates the strlen() call (which iterates over the entire string) and replaces it with a NULL check. It uses two pointers into the string, source and dest, which track the current read location (iterating over the original string) and the current write location (the destination for in-place copying of the next non-duplicate character).

\$\begingroup\$ Sorry you deserve the credit here. \$\endgroup\$

Loki Astari
– Loki Astari

2014年12月04日 05:00:25 +00:00
Commented Dec 4, 2014 at 5:00

Joaquim Rendeiro Joaquim Rendeiro 1514 bronze badges · Answer 3 · 2011-10-18 12:53:00Z

If you're optimizing for time, you could waste some memory and use an array of length sizeof(char) -- let's call it seen -- initialized with zeros.

In the loop you'd a few things:

if the seen count is greater than zero, increment an offset variable (the value of this offset is equal to how many duplicates of all characters have been found so far)
copy the current character to a new position offset characters behind, if it was not previously seen -- meaning, if seen[current_char] is zero (possibly only if offset is non-zero, otherwise "from" and "to" position are the same)
Increment seen[current_char] at the end of the loop

Note that current_char is really the value of the character, not the index of the character in the input string.

Stack Exchange Network

Removing any duplicate characters in a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Removing any duplicate characters in a string

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions