How can I optimize this hash string function?

Question 1

I'm writing a function which gets two strings and a number k as an input, and outputs a string with length k which is present in both strings. I want to do this using the function hash_sequence().

I need my code to be as efficient as possible because my inputs are very large strings.

def hash_sequence(string, k):
 dictionary={}
 for i in range(len(string)-(k-1)):
 triplet = string[i:i+k]
 dictionary.setdefault(triplet,[]).append(i)
 return dictionary
def intersects(string1, string2, k):
 itr=min(len(hash_sequence(string1,k)),len(hash_sequence(string2,k)))
 for i in range(itr):
 if list(hash_sequence(string1,k))[i] in hash_sequence(string2,k):
 return list(hash_sequence(string1,k))[i]
 return None

Question 2

Your hash_sequence() function is quite good. The only thing I would suggest about that function is rename your triplet variable to something more suitable like sequence. Currently, I expect triplet to contain strings of length 3. A name like sequence is more general and is a better representation of what will be stored in it.

Your intersects() function is where the vast improvement can be found. Each iteration of the for loop you are re-running the hash_sequence() function. If your strings are really long and k really small (i.e. 1), hash_sequence() would have to scan through the entirety of both strings each iteration of the loop!

Instead, we only have to hash one of the strings! Before the loop save the returned dict to a variable (this way the hash_sequence function is only ran once):

def intersects(string1, string2, k):
 dictionary= hash_sequence(string1, k)

Next, instead of iterating over the length of the dicts we can iterate over the keys in the dict, then we can check if that key is 'in' the other string:

for key in dictionary:
 # Check if the key (string of length k) is in the other string
 if key in string_two:
 return [key]

Here is my full version of your code:

def hash_sequence(string, length):
 dictionary = {}
 for i in range(len(string) - (length-1)):
 sequence = string[i:i+length]
 dictionary.setdefault(sequence, []).append(i)
 return dictionary
def intersects(string_one, string_two, length):
 dictionary = hash_sequence(string_one, length)
 for key in dictionary:
 if key in string_two:
 return [key]
 return None

As a final note, take a look at PEP8, the official Python style conventions. It will help make your code more Pythonic.

Question 3

if val in string_two: leads to TypeError: 'in <string>' requires string as left operand, not list. Did I do something wrong ?

Question 4

@DarinDouglass I got the same error Josay got. How it can be fixed?

Question 5

No, the error was mine: You need to check the keys not the values. I will fix the answer. It has been updated.

Question 6

DarinDouglas's comment is right. However, it forgets to point out that your code actually does not work (or at least it will not always work).

Before going any further, it might be interesting to write a few unit tests. Here's what I have so far (the names suck). One you have this, you can see that something is not quite right if you run the tests (python3 -m unittest intersect.py) enough times.

class Test(unittest.TestCase):
 def test_intersect(self):
 self.assertIn( intersects("abc", "bc", 1), ['b', 'c'])
 self.assertIn( intersects("abc", "bc", 2), ['bc'])
 self.assertIsNone(intersects("abc", "bc", 3))
 self.assertIn( intersects("abcd", "bcd", 1), ['b', 'c', 'd'])
 self.assertIn( intersects("abcd", "bcd", 2), ['bc', 'cd'])
 self.assertIn( intersects("abcd", "bcd", 3), ['bcd'])
 self.assertIsNone(intersects("abcd", "bcd", 4))

Now, you should be able to see that something is wrong. Indeed, a dictionnary is an unsorted container so when you convert it to a list, you do not know for sure the order of the elements of the list. This is a big problem in your case because list(hash_sequence(string1,k))[i] will call the function and build a new list possibly in the different order. For that reason, values might be missed.

It took me a while to spot the issue but once we know it, the fix is easy : let's call the function and build the lists only once.

def intersects(string1, string2, k):
 l1 = list(hash_sequence(string1,k))
 l2 = list(hash_sequence(string2,k))
 itr=min(len(l1),len(l2))
 for i in range(itr):
 if l1[i] in l2:
 return l1[i]
 return None

You are not done yet because there is another bug in your case which does not always appear (depending of the order of the element in the list). Indeed, the idea is to check for each element in l1 if it is in l2. This is NOT what you are doing at the moment. Indeed, if l2 is shorter than l1, you won't check every element of l1. Once this is understood, the bug is quite easy to fix.

def intersects(string1, string2, k):
 l1 = list(hash_sequence(string1,k))
 l2 = list(hash_sequence(string2,k))
 for i in range(len(l1)):
 if l1[i] in l2:
 return l1[i]
 return None

This seems to be working better but can be improved. Indeed, the pythonic way to loop over containers is not to used indices.

You can make your code safer, faster, clearer and shorter by writing :

def intersects(string1, string2, k):
 l1 = list(hash_sequence(string1,k))
 l2 = list(hash_sequence(string2,k))
 for e in l1:
 if e in l2:
 return e
 return None

Now, let's have a look at hash_sequence. First thing I noticed is that it has a quite confusing name as it can make one think that it will be about computing hashes. Also, it returns a dictionnary for no obvious reasons : a set would probably do the trick here :

def get_chunks(string, length):
 s=set()
 for i in range(len(string)-(length-1)):
 chunk = string[i:i+length]
 s.add(chunk)
 return s

This being done, my unit tests are still passing which gives me some confidence that everything's allright. I can now go back to intersects and improve it : we don't really need to convert the sets to lists : the in is faster on sets.

def intersects(string1, string2, k):
 s1 = get_chunks(string1,k)
 s2 = get_chunks(string2,k)
 for e in s1:
 if e in s2:
 return e
 return None

Now, you could make your function a bit more powerful (and a bit slower) by returning all common chunks and not just one. This could be easily done using the buillt-in operations on set (return get_chunks(string1,k) & get_chunks(string2,k)).

A final thing (a few other things could be said but I am running out of time and it might be too much for the time being) : if you want to pretend you are a cool kid, you can use set comprehension in your get_chunks function. Then it becomes the concise :

def get_chunks(string, length):
 return set(string[i:i+length] for i in range(len(string)-(length-1)))

The whole code becomes :

import unittest
def get_chunks(string, length):
 return set(string[i:i+length] for i in range(len(string)-(length-1)))
def intersects(string1, string2, k):
 s1 = get_chunks(string1,k)
 s2 = get_chunks(string2,k)
 for e in s1:
 if e in s2:
 return e
 return None
class Test(unittest.TestCase):
 def test_intersect(self):
 self.assertIn( intersects("abc", "bc", 1), ['b', 'c'])
 self.assertIn( intersects("abc", "bc", 2), ['bc'])
 self.assertIsNone(intersects("abc", "bc", 3))
 self.assertIn( intersects("abcd", "bcd", 1), ['b', 'c', 'd'])
 self.assertIn( intersects("abcd", "bcd", 2), ['bc', 'cd'])
 self.assertIn( intersects("abcd", "bcd", 3), ['bcd'])
 self.assertIsNone(intersects("abcd", "bcd", 4))
 self.assertIn( intersects("bc", "abc", 1), ['b', 'c'])
 self.assertIn( intersects("bc", "abc", 2), ['bc'])
 self.assertIsNone(intersects("bc", "abc", 3))
 self.assertIn( intersects("bcd", "abcd", 1), ['b', 'c', 'd'])
 self.assertIn( intersects("bcd", "abcd", 2), ['bc', 'cd'])
 self.assertIn( intersects("bcd", "abcd", 3), ['bcd'])
 self.assertIsNone(intersects("bcd", "abcd", 4))

Question 7

Good catch. I saw the hast-to-list conversion originally as well, I just forgot to write it in. :P Subconsciously it was another reason to use the in comparision. :D

Question 8

One easy thing you might do is switch from range to xrange -- xrange is a generator, while range constructs the entire iterable in memory.

BeetDemGuise BeetDemGuise 4,21612 silver badges29 bronze badges · Accepted Answer · 2014-05-13 18:09:39Z

Your hash_sequence() function is quite good. The only thing I would suggest about that function is rename your triplet variable to something more suitable like sequence. Currently, I expect triplet to contain strings of length 3. A name like sequence is more general and is a better representation of what will be stored in it.

Your intersects() function is where the vast improvement can be found. Each iteration of the for loop you are re-running the hash_sequence() function. If your strings are really long and k really small (i.e. 1), hash_sequence() would have to scan through the entirety of both strings each iteration of the loop!

Instead, we only have to hash one of the strings! Before the loop save the returned dict to a variable (this way the hash_sequence function is only ran once):

def intersects(string1, string2, k):
 dictionary= hash_sequence(string1, k)

Next, instead of iterating over the length of the dicts we can iterate over the keys in the dict, then we can check if that key is 'in' the other string:

for key in dictionary:
 # Check if the key (string of length k) is in the other string
 if key in string_two:
 return [key]

Here is my full version of your code:

def hash_sequence(string, length):
 dictionary = {}
 for i in range(len(string) - (length-1)):
 sequence = string[i:i+length]
 dictionary.setdefault(sequence, []).append(i)
 return dictionary
def intersects(string_one, string_two, length):
 dictionary = hash_sequence(string_one, length)
 for key in dictionary:
 if key in string_two:
 return [key]
 return None

As a final note, take a look at PEP8, the official Python style conventions. It will help make your code more Pythonic.

if val in string_two: leads to TypeError: 'in <string>' requires string as left operand, not list. Did I do something wrong ?
@DarinDouglass I got the same error Josay got. How it can be fixed?
No, the error was mine: You need to check the keys not the values. I will fix the answer. It has been updated.

Stack Exchange Network

How can I optimize this hash string function?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How can I optimize this hash string function?

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions