Check whether a string has all unique characters (time efficiency)

Question 1

I don't find this algorithm in Go language here, so I just want to check that it is really most efficient algorithm in Go:

func checkUnique(s *string) bool {
 var a [256]bool
 for _, ascii := range *s {
 if a[ascii] {
 return false
 }
 a[ascii] = true
 }
 return true
}

Question 2

There is no reason what-so-ever to use a *string argument here instead of a simple string. Although Go is pass-by-value the value of a Go string is effectively two int-sized values (an integer string length and a pointer to the start of the string) so passing a string of any size by value is efficient.

Question 3

This is about the most efficient strategy you can get when it comes to uniqueness checking on a string, as far as I know. It can get pretty crazy with memory inefficiency with bigger data types, but it's still O(N) time.

But unless you are specifically dealing with ASCII-only strings, be careful with hardcoding the size of your filter array. Go will try to parse the string into single Unicode code points, also called runes. You can find an example of that here, which explicitly shows how characters can span multiple bytes.

Based on the answer to this SO question, UTF-8 encoding can be anywhere from 1-4 bytes. Using the same strategy, a bool array that handles all 4 possible bytes would be of size

2³² = 4,294,967,296

Yikes, over 4GB of almost entirely unused memory. The most memory-efficient way I can think of approaching that would be a 256-way tree, where you track each byte encountered by creating nodes for each byte in the rune, and check for the rune by checking for existence of each node in order.

Is it still linear time? Well, existence is easy because the depth of the 256-way tree is at most 4, given the maximum number of bytes in a UTF-8 character, so we check for at most 4 things, which is constant. Same kind of logic applies to setting node values when we encounter unique runes. Since these are done per rune, it's still O(N), more memory efficient, and handles all UTF-8 characters.

I've actually never worked with Go before, thanks for inspiring me to research a bit :)

Question 4

You don't need an array of length \2ドル^{32}\,ドル it's enough to have 0x110000 since the maximum Unicode code point is U+10FFFF, which is a little over one million, not 4 billion.

Question 5

Another alternative to using an array would be to use a map[rune]bool. So as you go along insert runes into the map, then check if a value is in the map: _, ok := m[curr], etc. While it may not be guaranteed \$O(1)\$ insert, I assume the Go runtime has heavily optimized maps.

Question 6

Why not use a golang map as mentioned in one of the comments above?

func unique(arr string) bool {
 m := make(map[rune]bool)
 for _, i := range arr {
 _, ok := m[i]
 if ok {
 return false
 }
 m[i] = true
 }
 return true
}

Time complexity would be O(1) for insertion most of the time and in O(N) it would be done.

Question 7

In Go, string character values are Unicode characters encoded in UTF-8. UTF-8 is a variable-length encoding which uses one to four bytes per character.

Rewriting your ASCII code for Unicode,

unique.go:

package main
import (
 "fmt"
 "unicode/utf8"
)
func ascii(s string) (int, bool) {
 if len(s) > utf8.RuneSelf {
 return 0, false
 }
 var ascii [utf8.RuneSelf]bool
 for i := 0; i < len(s); i++ {
 b := int(s[i])
 if b >= utf8.RuneSelf {
 return 0, false
 }
 if ascii[b] {
 return len(s), false
 }
 ascii[b] = true
 }
 return len(s), true
}
func hashmap(s string) bool {
 var chars = make(map[rune]struct{}, len(s)/(utf8.UTFMax-1))
 for _, r := range s {
 if _, ok := chars[r]; ok {
 if r != utf8.RuneError {
 return false
 }
 }
 chars[r] = struct{}{}
 }
 return true
}
func unique(s string) bool {
 if i, u := ascii(s); i >= len(s) {
 return u
 }
 var chars []uint8
 for _, r := range s {
 if int(r) >= 8*len(chars) {
 var t []uint8
 if r >= rune(1<<16) {
 if len(s) <= 1000 {
 return hashmap(s)
 }
 t = make([]uint8, (utf8.MaxRune+1+7)/8)
 } else if r >= rune(1<<8) {
 t = make([]uint8, (1<<16)/8)
 } else {
 t = make([]uint8, (1<<8)/8)
 }
 copy(t, chars)
 chars = t
 }
 char := chars[r/8]
 if char&(1<<uint(r%8)) != 0 {
 if r != utf8.RuneError {
 return false
 }
 }
 char |= (1 << uint(r%8))
 chars[r/8] = char
 }
 return true
}
func main() {
 var r [utf8.MaxRune + 1]rune
 for i := range r {
 r[i] = rune(i)
 }
 s := string(r[:])
 fmt.Println(unique(s))
 r[0] = r[len(r)-1]
 s = string(r[:])
 fmt.Println(unique(s))
}

Playground: https://play.golang.org/p/31i7YLzJDhM

Output:

true
false

A benchmark using the 95 printable ASCII characters.

$ go test unique.go ascii_test.go -bench=. -benchmem
asciiChar: n 95; min U+0020; max U+007E
BenchmarkASCII-8 18052088 66.0 ns/op 0 B/op 0 allocs/op
$

ascii_test.go:

package main
import (
 "fmt"
 "testing"
 "unicode"
)
// ASCII printable characters
// asciiChar: n 95; min U+0020; max U+007E
var asciiChar = func() []rune {
 var ascii []rune
 for r := rune(0); r <= unicode.MaxASCII; r++ {
 if unicode.IsPrint(r) {
 ascii = append(ascii, r)
 }
 }
 fmt.Printf("asciiChar: n %d; min %U; max %U\n", len(ascii), ascii[0], ascii[len(ascii)-1])
 return ascii
}()
var u bool
func BenchmarkASCII(b *testing.B) {
 s := string(asciiChar)
 b.ResetTimer()
 for N := 0; N < b.N; N++ {
 u = unique(s)
 if !u {
 b.Fatal(u)
 }
 }
}

A benchmark (stress test) using the 89,233 Unicode Han script characters. For Version 12.0 of the Unicode Standard (2019) there are a total of 137,929 characters.

$ go test unique.go han_test.go -bench=. -benchmem
hanChar: n 89233; min U+2E80; max U+2FA1D
BenchmarkHan-8 1602 740748 ns/op 147456 B/op 2 allocs/op
$

han_test.go:

package main
import (
 "fmt"
 "testing"
 "unicode"
)
// Unicode Han script
// hanChar: n 89233; min U+2E80; max U+2FA1D
var hanChar = func() []rune {
 var han []rune
 for r := rune(0); r <= unicode.MaxRune; r++ {
 if unicode.In(r, unicode.Han) {
 han = append(han, r)
 }
 }
 fmt.Printf("hanChar: n %d; min %U; max %U\n", len(han), han[0], han[len(han)-1])
 return han
}()
var u bool
func BenchmarkHan(b *testing.B) {
 s := string(hanChar)
 b.ResetTimer()
 for N := 0; N < b.N; N++ {
 u = unique(s)
 if !u {
 b.Fatal(u)
 }
 }
}

Question 8

You didn't review the code but presented an alternative solution, without even explaining anything. That's against the spirit of this site.

Question 9

@RolandIllig: I did review the code. The code is for extended ASCII,. In Go, it should be for Unicode.

TCFP TCFP 3341 silver badge3 bronze badges · Accepted Answer · 2019-07-03 20:10:36Z

This is about the most efficient strategy you can get when it comes to uniqueness checking on a string, as far as I know. It can get pretty crazy with memory inefficiency with bigger data types, but it's still O(N) time.

But unless you are specifically dealing with ASCII-only strings, be careful with hardcoding the size of your filter array. Go will try to parse the string into single Unicode code points, also called runes. You can find an example of that here, which explicitly shows how characters can span multiple bytes.

Based on the answer to this SO question, UTF-8 encoding can be anywhere from 1-4 bytes. Using the same strategy, a bool array that handles all 4 possible bytes would be of size

2³² = 4,294,967,296

Yikes, over 4GB of almost entirely unused memory. The most memory-efficient way I can think of approaching that would be a 256-way tree, where you track each byte encountered by creating nodes for each byte in the rune, and check for the rune by checking for existence of each node in order.

Is it still linear time? Well, existence is easy because the depth of the 256-way tree is at most 4, given the maximum number of bytes in a UTF-8 character, so we check for at most 4 things, which is constant. Same kind of logic applies to setting node values when we encounter unique runes. Since these are done per rune, it's still O(N), more memory efficient, and handles all UTF-8 characters.

I've actually never worked with Go before, thanks for inspiring me to research a bit :)

You don't need an array of length \2ドル^{32}\,ドル it's enough to have 0x110000 since the maximum Unicode code point is U+10FFFF, which is a little over one million, not 4 billion.
Another alternative to using an array would be to use a map[rune]bool. So as you go along insert runes into the map, then check if a value is in the map: _, ok := m[curr], etc. While it may not be guaranteed \$O(1)\$ insert, I assume the Go runtime has heavily optimized maps.

Stack Exchange Network

Check whether a string has all unique characters (time efficiency)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Check whether a string has all unique characters (time efficiency)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions