5
\$\begingroup\$

I don't find this algorithm in Go language here, so I just want to check that it is really most efficient algorithm in Go:

func checkUnique(s *string) bool {
 var a [256]bool
 for _, ascii := range *s {
 if a[ascii] {
 return false
 }
 a[ascii] = true
 }
 return true
}
dfhwze
14.1k3 gold badges40 silver badges101 bronze badges
asked Jul 3, 2019 at 18:40
\$\endgroup\$
1
  • \$\begingroup\$ There is no reason what-so-ever to use a *string argument here instead of a simple string. Although Go is pass-by-value the value of a Go string is effectively two int-sized values (an integer string length and a pointer to the start of the string) so passing a string of any size by value is efficient. \$\endgroup\$ Commented Jul 7, 2019 at 11:48

3 Answers 3

5
\$\begingroup\$

This is about the most efficient strategy you can get when it comes to uniqueness checking on a string, as far as I know. It can get pretty crazy with memory inefficiency with bigger data types, but it's still O(N) time.

But unless you are specifically dealing with ASCII-only strings, be careful with hardcoding the size of your filter array. Go will try to parse the string into single Unicode code points, also called runes. You can find an example of that here, which explicitly shows how characters can span multiple bytes.

Based on the answer to this SO question, UTF-8 encoding can be anywhere from 1-4 bytes. Using the same strategy, a bool array that handles all 4 possible bytes would be of size

232 = 4,294,967,296

Yikes, over 4GB of almost entirely unused memory. The most memory-efficient way I can think of approaching that would be a 256-way tree, where you track each byte encountered by creating nodes for each byte in the rune, and check for the rune by checking for existence of each node in order.

Is it still linear time? Well, existence is easy because the depth of the 256-way tree is at most 4, given the maximum number of bytes in a UTF-8 character, so we check for at most 4 things, which is constant. Same kind of logic applies to setting node values when we encounter unique runes. Since these are done per rune, it's still O(N), more memory efficient, and handles all UTF-8 characters.

I've actually never worked with Go before, thanks for inspiring me to research a bit :)

answered Jul 3, 2019 at 20:10
\$\endgroup\$
2
  • 1
    \$\begingroup\$ You don't need an array of length \2ドル^{32}\,ドル it's enough to have 0x110000 since the maximum Unicode code point is U+10FFFF, which is a little over one million, not 4 billion. \$\endgroup\$ Commented Jul 5, 2019 at 5:07
  • 2
    \$\begingroup\$ Another alternative to using an array would be to use a map[rune]bool. So as you go along insert runes into the map, then check if a value is in the map: _, ok := m[curr], etc. While it may not be guaranteed \$O(1)\$ insert, I assume the Go runtime has heavily optimized maps. \$\endgroup\$ Commented Jul 5, 2019 at 17:00
2
\$\begingroup\$

Why not use a golang map as mentioned in one of the comments above?

func unique(arr string) bool {
 m := make(map[rune]bool)
 for _, i := range arr {
 _, ok := m[i]
 if ok {
 return false
 }
 m[i] = true
 }
 return true
}

Time complexity would be O(1) for insertion most of the time and in O(N) it would be done.

answered Oct 27, 2020 at 14:37
\$\endgroup\$
0
\$\begingroup\$

In Go, string character values are Unicode characters encoded in UTF-8. UTF-8 is a variable-length encoding which uses one to four bytes per character.

Rewriting your ASCII code for Unicode,

unique.go:

package main
import (
 "fmt"
 "unicode/utf8"
)
func ascii(s string) (int, bool) {
 if len(s) > utf8.RuneSelf {
 return 0, false
 }
 var ascii [utf8.RuneSelf]bool
 for i := 0; i < len(s); i++ {
 b := int(s[i])
 if b >= utf8.RuneSelf {
 return 0, false
 }
 if ascii[b] {
 return len(s), false
 }
 ascii[b] = true
 }
 return len(s), true
}
func hashmap(s string) bool {
 var chars = make(map[rune]struct{}, len(s)/(utf8.UTFMax-1))
 for _, r := range s {
 if _, ok := chars[r]; ok {
 if r != utf8.RuneError {
 return false
 }
 }
 chars[r] = struct{}{}
 }
 return true
}
func unique(s string) bool {
 if i, u := ascii(s); i >= len(s) {
 return u
 }
 var chars []uint8
 for _, r := range s {
 if int(r) >= 8*len(chars) {
 var t []uint8
 if r >= rune(1<<16) {
 if len(s) <= 1000 {
 return hashmap(s)
 }
 t = make([]uint8, (utf8.MaxRune+1+7)/8)
 } else if r >= rune(1<<8) {
 t = make([]uint8, (1<<16)/8)
 } else {
 t = make([]uint8, (1<<8)/8)
 }
 copy(t, chars)
 chars = t
 }
 char := chars[r/8]
 if char&(1<<uint(r%8)) != 0 {
 if r != utf8.RuneError {
 return false
 }
 }
 char |= (1 << uint(r%8))
 chars[r/8] = char
 }
 return true
}
func main() {
 var r [utf8.MaxRune + 1]rune
 for i := range r {
 r[i] = rune(i)
 }
 s := string(r[:])
 fmt.Println(unique(s))
 r[0] = r[len(r)-1]
 s = string(r[:])
 fmt.Println(unique(s))
}

Playground: https://play.golang.org/p/31i7YLzJDhM

Output:

true
false

A benchmark using the 95 printable ASCII characters.

$ go test unique.go ascii_test.go -bench=. -benchmem
asciiChar: n 95; min U+0020; max U+007E
BenchmarkASCII-8 18052088 66.0 ns/op 0 B/op 0 allocs/op
$

ascii_test.go:

package main
import (
 "fmt"
 "testing"
 "unicode"
)
// ASCII printable characters
// asciiChar: n 95; min U+0020; max U+007E
var asciiChar = func() []rune {
 var ascii []rune
 for r := rune(0); r <= unicode.MaxASCII; r++ {
 if unicode.IsPrint(r) {
 ascii = append(ascii, r)
 }
 }
 fmt.Printf("asciiChar: n %d; min %U; max %U\n", len(ascii), ascii[0], ascii[len(ascii)-1])
 return ascii
}()
var u bool
func BenchmarkASCII(b *testing.B) {
 s := string(asciiChar)
 b.ResetTimer()
 for N := 0; N < b.N; N++ {
 u = unique(s)
 if !u {
 b.Fatal(u)
 }
 }
}

A benchmark (stress test) using the 89,233 Unicode Han script characters. For Version 12.0 of the Unicode Standard (2019) there are a total of 137,929 characters.

$ go test unique.go han_test.go -bench=. -benchmem
hanChar: n 89233; min U+2E80; max U+2FA1D
BenchmarkHan-8 1602 740748 ns/op 147456 B/op 2 allocs/op
$

han_test.go:

package main
import (
 "fmt"
 "testing"
 "unicode"
)
// Unicode Han script
// hanChar: n 89233; min U+2E80; max U+2FA1D
var hanChar = func() []rune {
 var han []rune
 for r := rune(0); r <= unicode.MaxRune; r++ {
 if unicode.In(r, unicode.Han) {
 han = append(han, r)
 }
 }
 fmt.Printf("hanChar: n %d; min %U; max %U\n", len(han), han[0], han[len(han)-1])
 return han
}()
var u bool
func BenchmarkHan(b *testing.B) {
 s := string(hanChar)
 b.ResetTimer()
 for N := 0; N < b.N; N++ {
 u = unique(s)
 if !u {
 b.Fatal(u)
 }
 }
}
answered Jul 4, 2019 at 8:25
\$\endgroup\$
2
  • \$\begingroup\$ You didn't review the code but presented an alternative solution, without even explaining anything. That's against the spirit of this site. \$\endgroup\$ Commented Jul 5, 2019 at 5:10
  • 1
    \$\begingroup\$ @RolandIllig: I did review the code. The code is for extended ASCII,. In Go, it should be for Unicode. \$\endgroup\$ Commented Jul 5, 2019 at 5:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.