This code uses the Rabin-Karp algorithm to compare two text files. It takes a few minutes when both files contain around 1000 words. How can it be improved?
class RollingHash:
def __init__(self, text, sizeWord):
self.text = text
self.hash = 0
self.sizeWord = sizeWord
for i in range(0, sizeWord):
#subtract out the ASCII value of "a" to start the indexing at zero
self.hash += (ord(self.text[i]) - ord("a")+1)*(26**(sizeWord - i -1))
#start index of current window
self.window_start = 0
#end of index window
self.window_end = sizeWord
def move_window(self):
if self.window_end <= len(self.text) - 1:
self.hash -= (ord(self.text[self.window_start]) - ord("a")+1)*26**(self.sizeWord-1)
self.hash *= 26
self.hash += ord(self.text[self.window_end])- ord("a")+1
self.window_start += 1
self.window_end += 1
def window_text(self):
return self.text[self.window_start:self.window_end]
def rabin_karp(word, text):
if word == "" or text == "":
return None
if len(word) > len(text):
return None
rolling_hash = RollingHash(text, len(word))
word_hash = RollingHash(word, len(word))
word_hash.move_window()
for i in range(len(text) - len(word) + 1):
if rolling_hash.hash == word_hash.hash:
if rolling_hash.window_text() == word:
return i
rolling_hash.move_window()
return None
2 Answers 2
collisions
self.hash += (ord(self.text[i]) - ord("a") + 1) * (26 ** (sizeWord - i - 1))
Let's talk about hash functions.
Most of them admit of
collisions.
That is, we boil a domain of N bits down to
a range of M bits (\$M \ll N\$).
In which case a proper text matching function
would check for false positive by comparing
word
characters against the proposed text
match, as the OP code does here:
if rolling_hash.window_text() == word:
return i
takes a few minutes ...
time reduce from 1m51sec to 44sec. Is it possible to modify the code to reduce more ...
Now, why would someone want to accept the (terrible?) possibility of collisions? Sounds like a stick that will whack you, no?
Well, the carrot is that using a fixed M-size
hash lets us compute certain hashing operations
in \$O(M)\$ time, that is, in \$O(1)\$ constant time.
In contrast the OP inner loop runs in \$O(N)\$ time,
which increases with bigger inputs of N word
characters,
leading to \$O(N^2)\$ quadratic behavior.
That is, we want the possibility of (occasional) collisions,
as that lets us turn quadratic into linear behavior.
In many environments we might find that setting the parameter \$M = 64\$ bits is a popular choice, well-aligned with the underlying CPU architecture, while still admitting of collisions only seldom.
BigInt
Common Lisp and other environments offer BigNums,
which allocate more bits in order to reflect
increasingly larger magnitudes.
Unlike IEEE-754 floats, there's no approximation involved.
A python int
exactly implements a mathematician's
notion of \$\mathbb{Z}\$.
For any integer you'd care to name,
python is happy to store an exact representation of it.
modulo
The chief difficulty with the ... * (26 ** foo)
expression
is that it's asking python to allocate roughly an additional
five bits per input character.
For a word that's not very long, that might be OK.
Once you encounter longish words, elapsed time will take a hit.
What you wanted was to use a modulo operation, which limits how many bits the hash consumes. Define a large PRIME constant. On each iteration, back out the start character modulo PRIME, then incorporate the end character modulo PRIME.
97 offset
Subtracting ord("a")
is weird.
Suppose that both inputs were in US-ASCII, we didn't assume "26 characters", and no offset was applied. Then your expression suggests we're trying to allocate seven bits per input character, and we shall represent the input exactly. That is, the function is a perfect hash, collision free, and we can recover the exact input text from a given "hash" value.
But we don't verify inputs are "a" .. "z".
I anticipate some inputs might include SPACE, and "A" .. "Z".
So the hash += ...
is occasionally going to
add a negative quantity, which makes analyzing
the whole thing much messier than saying "there's
seven bits per input character".
Consider ensuring that we shall always add a positive quantity,
perhaps by removing that bias of 97.
Consider case-smashing to .lower()
.
Consider defining a 27-char alphabet which
includes SPACE, or perhaps a slightly larger alphabet,
and then turn 26 ** ...
into 27 ** ...
or
some larger base.
Summary:
We have added all of the numeric complexity of
cramming lowercase letters into an integer,
with none of the fixed-length benefits
offered by hashing.
The OP code's inner loop is almost
comparing word
against a similarly sized
string window, which is significantly more costly
than comparing a pair of 64-bit integers.
DRY
This constant expression is repeated several times:
ord("a")
It can be assigned to a constant in the __init__
function:
self.OFFSET = ord("a") + 1
This simplifies the code, and it also makes it more efficient since the
ord
function is only called once instead of every time through the for
loop.
In the rabin_karp
function, these expressions appear several times:
len(word)
len(text)
Again, set them to variables:
def rabin_karp(word, text):
if word == "" or text == "":
return None
word_len = len(word)
text_len = len(text)
if word_len > text_len:
return None
Simpler
In this line:
for i in range(0, sizeWord):
the default 0
start is not needed:
for i in range(sizeWord):
Naming
The PEP 8 style guide recommends snake_case for function and variable names.
sizeWord
would be size_word
The variable named hash
is the same name as a Python built-in function.
This can be confusing. To eliminate the confusion, rename the variable
as something like hash_val
. The first clue is that "hash" has special
coloring (syntax highlighting) in the question, as it does when I copy the code
into my editor.
Documentation
The PEP 8 style guide recommends adding docstrings for classes and functions.
class RollingHash:
"""
Briefly describe what rolling hash means and what the class is for
"""
For the rabin_karp
function, the docstring should explain
what the function does, the types of the inputs and what it returns.
You could also use type hints for the inputs and return.
The name of the function could be improved. Currently, it is just the name of an algorithm, but it would be better if it reflected the action it is performing.
sizeWord
". Might be a good idea to stick tocamel_case
. I know i'm a stickler about it, irl. \$\endgroup\$