Testing whether a given string has unique characters

Question 1

Here is my code which tests whether a given string has unique characters. The function is_unique which does a linear search for each character in the string is \$O(n^2)\$ (if I am right). Is there any more efficient way of doing the same with more efficient DS that takes less time (or space)?

I know with some library functions it can be done more easily, such as len(set(strr)) == len(strr)

def is_unique(strr):
 """Without using any library functions"""
 def _contains(string, char):
 for c in string:
 if c == char: return True
 return False
 for index in range(len(strr)):
 if _contains(strr[:index], strr[index]): return False
 return True 
def main():
 for string in ['abc', 'abcc', 'aabc']:
 print is_unique(string)
if __name__ == '__main__':
 main()

Edit 1 (Adds constraints)

Don't change the original string and memory used should be \$O(1)\$ hence sorting the string may not be the valid solution.

Question 2

You can do it in linear time via frequency counting. Make an array that is indexable by characters and initialise it with 0. Run over the string, incrementing the frequency for each character until that results in a value greater than 1 or the string is exhausted. Done.

EXTENDED EXPLANATION

Here is the algorithm expressed in pseudo code, which makes it easy to assess its complexity:

var seen: array [char] of boolean
seen = false
for each c in string_to_be_tested
 if seen[c]
 return false // string is not free duplicates
 seen[c] = true
return true // no character occurs more than once

That is the algorithm as such. CodeYogi has already posted a short and succinct pythonic approximation of it (first as a post scriptum, later it was moved into the question body):

len(set(strr)) == len(strr)

This compares the cardinality of the set of characters contained in the string to its length.

The set contains each character at most once, regardless how often it was added. This means that the cardinality of the set can only be equal to the length of the string if each of the characters occurred exactly once. This is not an algorithm as such, it is a functional expression whose algorithmic complexity is more difficult to assess than that of the simpler - if more verbose - imperative pseudo code.

When I wrote the first paragraph of the answer I was thinking along the lines of its most natural/efficient expression, which is in terms of the operation BTS ('bit test and set', or _bittestandset() as a C intrinsic). Here is a rendering of the algorithm in C, for the usual null-terminated strings:

// allocation and zeroing of the bitmap elided - too ugly and not relevant
bts(bitmap, 0); // pretend we've seen the terminator, so that we drop out at the end
while (!bts(bitmap, *s++))
 ;
return s[-1] == '0円'; // arrived at the end - unique, no duplicate characters

However, the BTS operation tends to be available only in assembly languages and C/C++; hence the description of the algorithm with an array of integer counts, which is much more widely available and which allows a compact approximation of the 'test and set' as 'increment and test':

if (++counts[string[i]] > 1)
 return false

That array is a (partial) frequency count, of course. It is possible to do it exactly like this in python, but that would do neither the algorithm nor the language any justice; it would be ugly and inefficient.

Question 3

"Make an array that is indexable by characters and initialise it with 0" I think you mean python dictionaries here ({'a': 0, 'b': 0, ...}) right?

Question 4

Yes and no. The algorithm as such uses a vector of booleans indexed by character to remember which characters have already been seen; a good approximation for this in python is a set of characters. I'll amend my post to make things clearer. Thanks for pointing out the problem!

Question 5

Also, an early return if we know the character encoding can be helpful too. Eg: if (len(strr) > 256) return False

Question 6

Obviously, if the given string is sorted, finding an answer is a matter of a linear search (\$O(n)\$). Sorting a string is \$O(n \log n)\,ドル which surely beats \$O(n^2)\$.

Question 7

No, its not sure if the string is sorted.

Question 8

I think you mean O(nlogn) + O(n) = O(nlogn)

DarthGizka DarthGizka 2,8211 gold badge13 silver badges30 bronze badges · Accepted Answer · 2014-11-18 06:07:37Z

You can do it in linear time via frequency counting. Make an array that is indexable by characters and initialise it with 0. Run over the string, incrementing the frequency for each character until that results in a value greater than 1 or the string is exhausted. Done.

EXTENDED EXPLANATION

Here is the algorithm expressed in pseudo code, which makes it easy to assess its complexity:

var seen: array [char] of boolean
seen = false
for each c in string_to_be_tested
 if seen[c]
 return false // string is not free duplicates
 seen[c] = true
return true // no character occurs more than once

That is the algorithm as such. CodeYogi has already posted a short and succinct pythonic approximation of it (first as a post scriptum, later it was moved into the question body):

len(set(strr)) == len(strr)

This compares the cardinality of the set of characters contained in the string to its length.

The set contains each character at most once, regardless how often it was added. This means that the cardinality of the set can only be equal to the length of the string if each of the characters occurred exactly once. This is not an algorithm as such, it is a functional expression whose algorithmic complexity is more difficult to assess than that of the simpler - if more verbose - imperative pseudo code.

When I wrote the first paragraph of the answer I was thinking along the lines of its most natural/efficient expression, which is in terms of the operation BTS ('bit test and set', or _bittestandset() as a C intrinsic). Here is a rendering of the algorithm in C, for the usual null-terminated strings:

// allocation and zeroing of the bitmap elided - too ugly and not relevant
bts(bitmap, 0); // pretend we've seen the terminator, so that we drop out at the end
while (!bts(bitmap, *s++))
 ;
return s[-1] == '0円'; // arrived at the end - unique, no duplicate characters

However, the BTS operation tends to be available only in assembly languages and C/C++; hence the description of the algorithm with an array of integer counts, which is much more widely available and which allows a compact approximation of the 'test and set' as 'increment and test':

if (++counts[string[i]] > 1)
 return false

That array is a (partial) frequency count, of course. It is possible to do it exactly like this in python, but that would do neither the algorithm nor the language any justice; it would be ugly and inefficient.

"Make an array that is indexable by characters and initialise it with 0" I think you mean python dictionaries here ({'a': 0, 'b': 0, ...}) right?
Yes and no. The algorithm as such uses a vector of booleans indexed by character to remember which characters have already been seen; a good approximation for this in python is a set of characters. I'll amend my post to make things clearer. Thanks for pointing out the problem!
Also, an early return if we know the character encoding can be helpful too. Eg: if (len(strr) > 256) return False

Stack Exchange Network

Testing whether a given string has unique characters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Testing whether a given string has unique characters

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions