I'm writing a function which gets two strings and a number k
as an input, and outputs a string with length k
which is present in both strings. I want to do this using the function hash_sequence()
.
I need my code to be as efficient as possible because my inputs are very large strings.
def hash_sequence(string, k):
dictionary={}
for i in range(len(string)-(k-1)):
triplet = string[i:i+k]
dictionary.setdefault(triplet,[]).append(i)
return dictionary
def intersects(string1, string2, k):
itr=min(len(hash_sequence(string1,k)),len(hash_sequence(string2,k)))
for i in range(itr):
if list(hash_sequence(string1,k))[i] in hash_sequence(string2,k):
return list(hash_sequence(string1,k))[i]
return None
3 Answers 3
Your hash_sequence()
function is quite good. The only thing I would suggest about that function is rename your triplet
variable to something more suitable like sequence
. Currently, I expect triplet
to contain strings of length 3
. A name like sequence
is more general and is a better representation of what will be stored in it.
Your intersects()
function is where the vast improvement can be found. Each iteration of the for loop you are re-running the hash_sequence()
function. If your strings are really long and k
really small (i.e. 1), hash_sequence()
would have to scan through the entirety of both strings each iteration of the loop!
Instead, we only have to hash one of the strings! Before the loop save the returned dict
to a variable (this way the hash_sequence
function is only ran once):
def intersects(string1, string2, k):
dictionary= hash_sequence(string1, k)
Next, instead of iterating over the length of the dicts
we can iterate over the keys in the dict
, then we can check if that key is 'in' the other string:
for key in dictionary:
# Check if the key (string of length k) is in the other string
if key in string_two:
return [key]
Here is my full version of your code:
def hash_sequence(string, length):
dictionary = {}
for i in range(len(string) - (length-1)):
sequence = string[i:i+length]
dictionary.setdefault(sequence, []).append(i)
return dictionary
def intersects(string_one, string_two, length):
dictionary = hash_sequence(string_one, length)
for key in dictionary:
if key in string_two:
return [key]
return None
As a final note, take a look at PEP8, the official Python style conventions. It will help make your code more Pythonic.
-
\$\begingroup\$
if val in string_two:
leads toTypeError: 'in <string>' requires string as left operand, not list
. Did I do something wrong ? \$\endgroup\$SylvainD– SylvainD2014年05月13日 18:11:21 +00:00Commented May 13, 2014 at 18:11 -
\$\begingroup\$ @DarinDouglass I got the same error Josay got. How it can be fixed? \$\endgroup\$user3369309– user33693092014年05月13日 18:24:36 +00:00Commented May 13, 2014 at 18:24
-
\$\begingroup\$ No, the error was mine: You need to check the keys not the values. I will fix the answer. It has been updated. \$\endgroup\$BeetDemGuise– BeetDemGuise2014年05月13日 18:26:28 +00:00Commented May 13, 2014 at 18:26
DarinDouglas's comment is right. However, it forgets to point out that your code actually does not work (or at least it will not always work).
Before going any further, it might be interesting to write a few unit tests. Here's what I have so far (the names suck). One you have this, you can see that something is not quite right if you run the tests (python3 -m unittest intersect.py
) enough times.
class Test(unittest.TestCase):
def test_intersect(self):
self.assertIn( intersects("abc", "bc", 1), ['b', 'c'])
self.assertIn( intersects("abc", "bc", 2), ['bc'])
self.assertIsNone(intersects("abc", "bc", 3))
self.assertIn( intersects("abcd", "bcd", 1), ['b', 'c', 'd'])
self.assertIn( intersects("abcd", "bcd", 2), ['bc', 'cd'])
self.assertIn( intersects("abcd", "bcd", 3), ['bcd'])
self.assertIsNone(intersects("abcd", "bcd", 4))
Now, you should be able to see that something is wrong. Indeed, a dictionnary is an unsorted container so when you convert it to a list, you do not know for sure the order of the elements of the list. This is a big problem in your case because list(hash_sequence(string1,k))[i]
will call the function and build a new list possibly in the different order. For that reason, values might be missed.
It took me a while to spot the issue but once we know it, the fix is easy : let's call the function and build the lists only once.
def intersects(string1, string2, k):
l1 = list(hash_sequence(string1,k))
l2 = list(hash_sequence(string2,k))
itr=min(len(l1),len(l2))
for i in range(itr):
if l1[i] in l2:
return l1[i]
return None
You are not done yet because there is another bug in your case which does not always appear (depending of the order of the element in the list). Indeed, the idea is to check for each element in l1
if it is in l2
. This is NOT what you are doing at the moment. Indeed, if l2
is shorter than l1
, you won't check every element of l1
. Once this is understood, the bug is quite easy to fix.
def intersects(string1, string2, k):
l1 = list(hash_sequence(string1,k))
l2 = list(hash_sequence(string2,k))
for i in range(len(l1)):
if l1[i] in l2:
return l1[i]
return None
This seems to be working better but can be improved. Indeed, the pythonic way to loop over containers is not to used indices.
You can make your code safer, faster, clearer and shorter by writing :
def intersects(string1, string2, k):
l1 = list(hash_sequence(string1,k))
l2 = list(hash_sequence(string2,k))
for e in l1:
if e in l2:
return e
return None
Now, let's have a look at hash_sequence
. First thing I noticed is that it has a quite confusing name as it can make one think that it will be about computing hashes. Also, it returns a dictionnary for no obvious reasons : a set would probably do the trick here :
def get_chunks(string, length):
s=set()
for i in range(len(string)-(length-1)):
chunk = string[i:i+length]
s.add(chunk)
return s
This being done, my unit tests are still passing which gives me some confidence that everything's allright. I can now go back to intersects
and improve it : we don't really need to convert the sets to lists : the in
is faster on sets.
def intersects(string1, string2, k):
s1 = get_chunks(string1,k)
s2 = get_chunks(string2,k)
for e in s1:
if e in s2:
return e
return None
Now, you could make your function a bit more powerful (and a bit slower) by returning all common chunks and not just one. This could be easily done using the buillt-in operations on set (return get_chunks(string1,k) & get_chunks(string2,k)
).
A final thing (a few other things could be said but I am running out of time and it might be too much for the time being) : if you want to pretend you are a cool kid, you can use set comprehension in your get_chunks
function. Then it becomes the concise :
def get_chunks(string, length):
return set(string[i:i+length] for i in range(len(string)-(length-1)))
The whole code becomes :
import unittest
def get_chunks(string, length):
return set(string[i:i+length] for i in range(len(string)-(length-1)))
def intersects(string1, string2, k):
s1 = get_chunks(string1,k)
s2 = get_chunks(string2,k)
for e in s1:
if e in s2:
return e
return None
class Test(unittest.TestCase):
def test_intersect(self):
self.assertIn( intersects("abc", "bc", 1), ['b', 'c'])
self.assertIn( intersects("abc", "bc", 2), ['bc'])
self.assertIsNone(intersects("abc", "bc", 3))
self.assertIn( intersects("abcd", "bcd", 1), ['b', 'c', 'd'])
self.assertIn( intersects("abcd", "bcd", 2), ['bc', 'cd'])
self.assertIn( intersects("abcd", "bcd", 3), ['bcd'])
self.assertIsNone(intersects("abcd", "bcd", 4))
self.assertIn( intersects("bc", "abc", 1), ['b', 'c'])
self.assertIn( intersects("bc", "abc", 2), ['bc'])
self.assertIsNone(intersects("bc", "abc", 3))
self.assertIn( intersects("bcd", "abcd", 1), ['b', 'c', 'd'])
self.assertIn( intersects("bcd", "abcd", 2), ['bc', 'cd'])
self.assertIn( intersects("bcd", "abcd", 3), ['bcd'])
self.assertIsNone(intersects("bcd", "abcd", 4))
-
\$\begingroup\$ Good catch. I saw the hast-to-list conversion originally as well, I just forgot to write it in. :P Subconsciously it was another reason to use the
in
comparision. :D \$\endgroup\$BeetDemGuise– BeetDemGuise2014年05月13日 19:10:35 +00:00Commented May 13, 2014 at 19:10
One easy thing you might do is switch from range
to xrange
-- xrange
is a generator, while range
constructs the entire iterable in memory.