I decided to learn Go, knowing Python. My approach is by solving easy problems like this one:
You are given two strings, A and B. Find if there is a substring that appears in both A and B. Several test cases will be given to you in a single file. The first line of the input will contain a single integer T, the number of test cases.
Then there will be T descriptions of the test cases. Each description contains two lines. The first line contains the string A and the second line contains the string B. For each test case, display YES (in a newline), if there is a common substring. Otherwise, display NO.
The problem is straightforward: check if two sets intersect.
The real problem is that for the biggest testcase (20 strings of length 105 each) my Go code takes 2.6 seconds, which is unacceptable, knowing that my Python code solves the problem in 0.08 seconds. Based on the fact that Go is claimed to be near C speed, I assume that I am doing something super inefficiently.
package main
import "fmt"
func subst(s1 string, s2 string) bool {
m := map[uint8]bool{}
for i := 0; i < len(s1); i++ {
m[s1[i]] = true
}
for i := 0; i < len(s2); i++ {
if m[s2[i]] {
return true
}
}
return false
}
func main() {
var n int
var s1, s2 string
fmt.Scanf("%d", &n)
for i := 0; i < n; i++ {
fmt.Scanf("%s", &s1)
fmt.Scanf("%s", &s2)
if subst(s1, s2) {
fmt.Println("YES")
} else {
fmt.Println("NO")
}
}
}
Python implementation:
def twoStrings(a, b):
if len(set(a) & set(b)):
print 'YES'
else:
print 'NO'
for i in xrange(input()):
twoStrings(raw_input(), raw_input())
Nonetheless, two snippets look really different, they are actually doing similar things. In go
I create a hash-map and put boolean value for each character from the first string (this is how sets are implemented in a lot of languages) and then for each character from the second string I am checking whether it is in a map and if no - exiting. It's obvious in the Python code.
2 Answers 2
This entire answer refers to my rewrite of the asker's code on the Go Playground and is mostly a summary of our discussion on the chat.so Go/Golang room. For reference, I'm reproducing the code below:
package main
import (
"bufio"
"io"
"os"
"strconv"
)
// charSet is a limited bitset to contain all lowercase latin characters (a-z).
type charSet struct {
bits uint32
}
// Set switches a bit to 1 for any given character i that is in the range 'a'
// to 'z'. Characters outside this range have undefined behavior.
func (c *charSet) Set(i uint8) {
c.bits |= 1 << (i - 'a')
}
// Intersects returns whether two charSets intersect (i.e., share one or more
// on bits).
func (c charSet) Intersects(o charSet) bool {
return c.bits&o.bits != 0
}
// set returns a charSet for all bytes in s. Bytes in s must be in the range of
// 'a' to 'z'. Anything outside that range is regarded as undefined behavior.
func set(s []byte) charSet {
var c charSet
for i := 0; i < len(s); i++ {
c.Set(s[i])
}
return c
}
// subst returns whether two strings share any characters. l and r are assumed
// to only contain characters in the range of 'a' to 'z'.
func subst(l, r []byte) bool {
return set(l).Intersects(set(r))
}
func main() {
var n int
r := bufio.NewReader(os.Stdin)
if l, err := r.ReadString('\n'); err == nil && len(l) > 1 {
if n, err = strconv.Atoi(l[:len(l)-1]); err != nil {
panic(err)
}
} else if err != nil {
panic(err)
}
for i := 0; i < n; i++ {
var s1, s2 []byte
var err error
s1, err = r.ReadBytes('\n')
if err == nil {
s2, err = r.ReadBytes('\n')
}
if err != nil && err != io.EOF {
panic(err)
}
if len(s1) > 0 && len(s2) > 0 && subst(s1[:len(s1)-1], s2[:len(s2)-1]) {
io.WriteString(os.Stdout, "YES\n")
} else {
io.WriteString(os.Stdout, "NO\n")
}
}
}
There are two major changes here to get this to the point that it at least beats Python in my own test runs on HackerRank:
Replace fmt.Scanf with just reading and parsing where necessary. The only input that requires actual parsing is the first integer, so there's no need for fmt.Scanf beyond that. Even initially we don't need it, however, because we're really only interested in parsing the first line. As such, I just replaced the entire thing with a buffered reader and read up to a delimiter, assuming that all lines end with
'\n'
(they should for this case). After that, I just run the first line throughstrconv.Atoi
.This chops off a reasonable amount of time in reading because it avoids parsing a format string and then modifying the input read to conform to the format string, as well as possible time spent in reflection depending on the data. I don't think it'll actually hit the reflect package here, but I haven't looked at the fmt.Fscanf implementation, so take that with a grain of salt.
Use a limited bitset (i.e., just one integer) instead of
map[uint8]bool
. This is more obvious as to why it's faster: there's no need to allocate more than four bytes on the stack to use an integer, and because the problem constrains the input strings to lowercase characters in the range of 'a' to 'z', we know it'll fit in the bounds of a 32-bit integer.The
charSet
type andset
function (which is just me being lazy and sort of mimicking the Python code) covers the entire implementation of this. All it does is, given a byte, update the underlying bits integer and test for an intersection using a bitwise and. To get a complete bitset for a string, or byte slice, just iterate over its length and set those bits.This results in pretty simple code, an acceptable interface for working with a bitset that makes changes to the implementation easier, and it's much faster than allocating memory for a map, creating new cells in the map, and lookups in the map. (I think it's likely that some improvement would be seen by giving the map enough initial capacity for the cardinality of its key, but I doubt if it'd be significant — haven't tested it, so that's something to look into.) These are all fast enough for day to day things, but if we're trying to beat the time on some program, data structures are low-hanging fruit.
With those changes, I get, at worst, 0.01s on HackerRank for test #4 and #5 (possibly both — I'm guessing it's rounding at this point) and 0s for all other tests. This is also fairly easy to benchmark using the testing package, but I don't have a way to compare those results with Python's, so I consider HackerRank's authoritative for now.
I have something to say about your Python implementation too:
You are mixing Input/Output and Logic
The function should return a boolean value, only after you should convert it to yes/no.
def twoStrings(a, b):
if len(set(a) & set(b)):
return True
else:
return False
def bool_to_yes_no(bool_):
return "YES" if bool_ else "NO"
You are being too verbose
Your function can be written in one line thanks to bool
def twoStrings(a, b):
return bool(len(set(a) & set(b)))
Indeed further inspection reveals that you are really checking if it is empty (i.e. has length 0), so we can just:
def twoStrings(a, b):
return set(a) & set(b)
This may return an entire set but the bool_to__yes_no
function will handle that just fine. If you really want booleans just add bool
around your expression.
Also the name is not so good, I would call it intersection
def have_intersection(a: str, b: str) -> bool:
return bool(set(a) & set(b))
Please note that this sintax only works in Python 3.x and that a and b in this particular case are strings but could anything else.
You are giving importance to the unecessary
Maybe this is nitpicking but programming is about precision:
for i in xrange(input()):
should be:
for _ in range(input()): #(`range` because I use Python 3).
because an ignored variable is conventionally called _
You have a security issue
People usually behave correctly but what if the file started with:
__import__(os).call('rm -rf') # Pseudocodish but you get the idea
You guessed it, all your files would be destroyed becaue input in old Python works like eval
Final version
Putting all my advice together:
def have_intersection(a, b):
return bool(set(a) & set(b))
def bool_to_yes_no(bool_):
return "YES" if bool_ else "NO"
for _ in range(int(input())):
print(bool_to_yes_no(have_intersection((input(), input())))
Please note that I am using input
because I am on Python 3 (that I reccomend over Python 2)
Explore related questions
See similar questions with these tags.
any(char in a and char in b for char in ascii_lowercase)
is 5x faster than using sets. \$\endgroup\$