I don't find this algorithm in Go language here, so I just want to check that it is really most efficient algorithm in Go:
func checkUnique(s *string) bool {
var a [256]bool
for _, ascii := range *s {
if a[ascii] {
return false
}
a[ascii] = true
}
return true
}
3 Answers 3
This is about the most efficient strategy you can get when it comes to uniqueness checking on a string, as far as I know. It can get pretty crazy with memory inefficiency with bigger data types, but it's still O(N) time.
But unless you are specifically dealing with ASCII-only strings, be careful with hardcoding the size of your filter array. Go will try to parse the string into single Unicode code points, also called runes. You can find an example of that here, which explicitly shows how characters can span multiple bytes.
Based on the answer to this SO question, UTF-8 encoding can be anywhere from 1-4 bytes. Using the same strategy, a bool array that handles all 4 possible bytes would be of size
232 = 4,294,967,296
Yikes, over 4GB of almost entirely unused memory. The most memory-efficient way I can think of approaching that would be a 256-way tree, where you track each byte encountered by creating nodes for each byte in the rune, and check for the rune by checking for existence of each node in order.
Is it still linear time? Well, existence is easy because the depth of the 256-way tree is at most 4, given the maximum number of bytes in a UTF-8 character, so we check for at most 4 things, which is constant. Same kind of logic applies to setting node values when we encounter unique runes. Since these are done per rune, it's still O(N), more memory efficient, and handles all UTF-8 characters.
I've actually never worked with Go before, thanks for inspiring me to research a bit :)
-
1\$\begingroup\$ You don't need an array of length \2ドル^{32}\,ドル it's enough to have 0x110000 since the maximum Unicode code point is U+10FFFF, which is a little over one million, not 4 billion. \$\endgroup\$Roland Illig– Roland Illig2019年07月05日 05:07:10 +00:00Commented Jul 5, 2019 at 5:07
-
2\$\begingroup\$ Another alternative to using an array would be to use a
map[rune]bool
. So as you go along insert runes into the map, then check if a value is in the map:_, ok := m[curr]
, etc. While it may not be guaranteed \$O(1)\$ insert, I assume the Go runtime has heavily optimized maps. \$\endgroup\$esote– esote2019年07月05日 17:00:58 +00:00Commented Jul 5, 2019 at 17:00
Why not use a golang map as mentioned in one of the comments above?
func unique(arr string) bool {
m := make(map[rune]bool)
for _, i := range arr {
_, ok := m[i]
if ok {
return false
}
m[i] = true
}
return true
}
Time complexity would be O(1) for insertion most of the time and in O(N) it would be done.
In Go, string character values are Unicode characters encoded in UTF-8. UTF-8 is a variable-length encoding which uses one to four bytes per character.
Rewriting your ASCII code for Unicode,
unique.go
:
package main
import (
"fmt"
"unicode/utf8"
)
func ascii(s string) (int, bool) {
if len(s) > utf8.RuneSelf {
return 0, false
}
var ascii [utf8.RuneSelf]bool
for i := 0; i < len(s); i++ {
b := int(s[i])
if b >= utf8.RuneSelf {
return 0, false
}
if ascii[b] {
return len(s), false
}
ascii[b] = true
}
return len(s), true
}
func hashmap(s string) bool {
var chars = make(map[rune]struct{}, len(s)/(utf8.UTFMax-1))
for _, r := range s {
if _, ok := chars[r]; ok {
if r != utf8.RuneError {
return false
}
}
chars[r] = struct{}{}
}
return true
}
func unique(s string) bool {
if i, u := ascii(s); i >= len(s) {
return u
}
var chars []uint8
for _, r := range s {
if int(r) >= 8*len(chars) {
var t []uint8
if r >= rune(1<<16) {
if len(s) <= 1000 {
return hashmap(s)
}
t = make([]uint8, (utf8.MaxRune+1+7)/8)
} else if r >= rune(1<<8) {
t = make([]uint8, (1<<16)/8)
} else {
t = make([]uint8, (1<<8)/8)
}
copy(t, chars)
chars = t
}
char := chars[r/8]
if char&(1<<uint(r%8)) != 0 {
if r != utf8.RuneError {
return false
}
}
char |= (1 << uint(r%8))
chars[r/8] = char
}
return true
}
func main() {
var r [utf8.MaxRune + 1]rune
for i := range r {
r[i] = rune(i)
}
s := string(r[:])
fmt.Println(unique(s))
r[0] = r[len(r)-1]
s = string(r[:])
fmt.Println(unique(s))
}
Playground: https://play.golang.org/p/31i7YLzJDhM
Output:
true
false
A benchmark using the 95 printable ASCII characters.
$ go test unique.go ascii_test.go -bench=. -benchmem
asciiChar: n 95; min U+0020; max U+007E
BenchmarkASCII-8 18052088 66.0 ns/op 0 B/op 0 allocs/op
$
ascii_test.go
:
package main
import (
"fmt"
"testing"
"unicode"
)
// ASCII printable characters
// asciiChar: n 95; min U+0020; max U+007E
var asciiChar = func() []rune {
var ascii []rune
for r := rune(0); r <= unicode.MaxASCII; r++ {
if unicode.IsPrint(r) {
ascii = append(ascii, r)
}
}
fmt.Printf("asciiChar: n %d; min %U; max %U\n", len(ascii), ascii[0], ascii[len(ascii)-1])
return ascii
}()
var u bool
func BenchmarkASCII(b *testing.B) {
s := string(asciiChar)
b.ResetTimer()
for N := 0; N < b.N; N++ {
u = unique(s)
if !u {
b.Fatal(u)
}
}
}
A benchmark (stress test) using the 89,233 Unicode Han script characters. For Version 12.0 of the Unicode Standard (2019) there are a total of 137,929 characters.
$ go test unique.go han_test.go -bench=. -benchmem
hanChar: n 89233; min U+2E80; max U+2FA1D
BenchmarkHan-8 1602 740748 ns/op 147456 B/op 2 allocs/op
$
han_test.go
:
package main
import (
"fmt"
"testing"
"unicode"
)
// Unicode Han script
// hanChar: n 89233; min U+2E80; max U+2FA1D
var hanChar = func() []rune {
var han []rune
for r := rune(0); r <= unicode.MaxRune; r++ {
if unicode.In(r, unicode.Han) {
han = append(han, r)
}
}
fmt.Printf("hanChar: n %d; min %U; max %U\n", len(han), han[0], han[len(han)-1])
return han
}()
var u bool
func BenchmarkHan(b *testing.B) {
s := string(hanChar)
b.ResetTimer()
for N := 0; N < b.N; N++ {
u = unique(s)
if !u {
b.Fatal(u)
}
}
}
-
\$\begingroup\$ You didn't review the code but presented an alternative solution, without even explaining anything. That's against the spirit of this site. \$\endgroup\$Roland Illig– Roland Illig2019年07月05日 05:10:42 +00:00Commented Jul 5, 2019 at 5:10
-
1\$\begingroup\$ @RolandIllig: I did review the code. The code is for extended ASCII,. In Go, it should be for Unicode. \$\endgroup\$peterSO– peterSO2019年07月05日 05:26:14 +00:00Commented Jul 5, 2019 at 5:26
*string
argument here instead of a simplestring
. Although Go is pass-by-value the value of a Gostring
is effectively two int-sized values (an integer string length and a pointer to the start of the string) so passing a string of any size by value is efficient. \$\endgroup\$