Rabin-Karp algorithm implementation to compare text

Question 1

This code uses the Rabin-Karp algorithm to compare two text files. It takes a few minutes when both files contain around 1000 words. How can it be improved?

class RollingHash:
 def __init__(self, text, sizeWord):
 self.text = text
 self.hash = 0 
 self.sizeWord = sizeWord
 for i in range(0, sizeWord):
 #subtract out the ASCII value of "a" to start the indexing at zero
 self.hash += (ord(self.text[i]) - ord("a")+1)*(26**(sizeWord - i -1))
 #start index of current window
 self.window_start = 0
 #end of index window
 self.window_end = sizeWord
 def move_window(self):
 if self.window_end <= len(self.text) - 1:
 
 self.hash -= (ord(self.text[self.window_start]) - ord("a")+1)*26**(self.sizeWord-1)
 self.hash *= 26
 self.hash += ord(self.text[self.window_end])- ord("a")+1
 self.window_start += 1
 self.window_end += 1
 def window_text(self):
 return self.text[self.window_start:self.window_end]
def rabin_karp(word, text):
 if word == "" or text == "":
 return None
 if len(word) > len(text):
 return None
 rolling_hash = RollingHash(text, len(word))
 word_hash = RollingHash(word, len(word))
 word_hash.move_window()
 for i in range(len(text) - len(word) + 1):
 if rolling_hash.hash == word_hash.hash:
 if rolling_hash.window_text() == word:
 return i
 rolling_hash.move_window()
 return None

Question 2

It is not Rabin-Karp but longest common substring what you need, look it in geeksforgeeks.org/longest-common-substring-dp-29 And your implementation is realy messy. Look in geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching And afaik (sorry it was long time ago) Rabin-Karp is not the best for substring search, google for suffix trees.

Question 3

@user8426627 My assignment required me to solve using Rabin-Karp. After using the code from geeksforgeeks.org/rabin-karp-algorithm-for-pattern-searching that you had mentioned above, the compile time reduce from 1m51sec to 44sec. Is it possible to modify the code to reduce more the compile time?

Question 4

"This code uses the Rabin-Karp algorithm to compare two text files." No, it does not -- I suppose you use this code as part of a solution that compares files. Posting the whole program and some data for testing might be a good idea.

Question 5

"sizeWord". Might be a good idea to stick to camel_case. I know i'm a stickler about it, irl.

Question 6

collisions

 self.hash += (ord(self.text[i]) - ord("a") + 1) * (26 ** (sizeWord - i - 1))

Let's talk about hash functions.

Most of them admit of collisions. That is, we boil a domain of N bits down to a range of M bits (\$M \ll N\$). In which case a proper text matching function would check for false positive by comparing word characters against the proposed text match, as the OP code does here:

 if rolling_hash.window_text() == word:
 return i

takes a few minutes ...

time reduce from 1m51sec to 44sec. Is it possible to modify the code to reduce more ...

Now, why would someone want to accept the (terrible?) possibility of collisions? Sounds like a stick that will whack you, no?

Well, the carrot is that using a fixed M-size hash lets us compute certain hashing operations in \$O(M)\$ time, that is, in \$O(1)\$ constant time. In contrast the OP inner loop runs in \$O(N)\$ time, which increases with bigger inputs of N word characters, leading to \$O(N^2)\$ quadratic behavior. That is, we want the possibility of (occasional) collisions, as that lets us turn quadratic into linear behavior.

In many environments we might find that setting the parameter \$M = 64\$ bits is a popular choice, well-aligned with the underlying CPU architecture, while still admitting of collisions only seldom.

BigInt

Common Lisp and other environments offer BigNums, which allocate more bits in order to reflect increasingly larger magnitudes. Unlike IEEE-754 floats, there's no approximation involved. A python int exactly implements a mathematician's notion of \$\mathbb{Z}\$. For any integer you'd care to name, python is happy to store an exact representation of it.

modulo

The chief difficulty with the ... * (26 ** foo) expression is that it's asking python to allocate roughly an additional five bits per input character. For a word that's not very long, that might be OK. Once you encounter longish words, elapsed time will take a hit.

What you wanted was to use a modulo operation, which limits how many bits the hash consumes. Define a large PRIME constant. On each iteration, back out the start character modulo PRIME, then incorporate the end character modulo PRIME.

97 offset

Subtracting ord("a") is weird.

Suppose that both inputs were in US-ASCII, we didn't assume "26 characters", and no offset was applied. Then your expression suggests we're trying to allocate seven bits per input character, and we shall represent the input exactly. That is, the function is a perfect hash, collision free, and we can recover the exact input text from a given "hash" value.

But we don't verify inputs are "a" .. "z". I anticipate some inputs might include SPACE, and "A" .. "Z". So the hash += ... is occasionally going to add a negative quantity, which makes analyzing the whole thing much messier than saying "there's seven bits per input character".

Consider ensuring that we shall always add a positive quantity, perhaps by removing that bias of 97. Consider case-smashing to .lower(). Consider defining a 27-char alphabet which includes SPACE, or perhaps a slightly larger alphabet, and then turn 26 ** ... into 27 ** ... or some larger base.

Summary:

We have added all of the numeric complexity of cramming lowercase letters into an integer, with none of the fixed-length benefits offered by hashing. The OP code's inner loop is almost comparing word against a similarly sized string window, which is significantly more costly than comparing a pair of 64-bit integers.

Question 7

DRY

This constant expression is repeated several times:

ord("a")

It can be assigned to a constant in the __init__ function:

self.OFFSET = ord("a") + 1

This simplifies the code, and it also makes it more efficient since the ord function is only called once instead of every time through the for loop.

In the rabin_karp function, these expressions appear several times:

len(word)
len(text)

Again, set them to variables:

def rabin_karp(word, text):
 if word == "" or text == "":
 return None
 word_len = len(word)
 text_len = len(text)
 if word_len > text_len:
 return None

Simpler

In this line:

for i in range(0, sizeWord):

the default 0 start is not needed:

for i in range(sizeWord):

Naming

The PEP 8 style guide recommends snake_case for function and variable names.

sizeWord would be size_word

The variable named hash is the same name as a Python built-in function. This can be confusing. To eliminate the confusion, rename the variable as something like hash_val. The first clue is that "hash" has special coloring (syntax highlighting) in the question, as it does when I copy the code into my editor.

Documentation

The PEP 8 style guide recommends adding docstrings for classes and functions.

class RollingHash:
 """
 Briefly describe what rolling hash means and what the class is for
 """

For the rabin_karp function, the docstring should explain what the function does, the types of the inputs and what it returns. You could also use type hints for the inputs and return.

The name of the function could be improved. Currently, it is just the name of an algorithm, but it would be better if it reflected the action it is performing.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2025-08-25 19:03:17Z

collisions

 self.hash += (ord(self.text[i]) - ord("a") + 1) * (26 ** (sizeWord - i - 1))

Let's talk about hash functions.

Most of them admit of collisions. That is, we boil a domain of N bits down to a range of M bits (\$M \ll N\$). In which case a proper text matching function would check for false positive by comparing word characters against the proposed text match, as the OP code does here:

 if rolling_hash.window_text() == word:
 return i

takes a few minutes ...

time reduce from 1m51sec to 44sec. Is it possible to modify the code to reduce more ...

Now, why would someone want to accept the (terrible?) possibility of collisions? Sounds like a stick that will whack you, no?

Well, the carrot is that using a fixed M-size hash lets us compute certain hashing operations in \$O(M)\$ time, that is, in \$O(1)\$ constant time. In contrast the OP inner loop runs in \$O(N)\$ time, which increases with bigger inputs of N word characters, leading to \$O(N^2)\$ quadratic behavior. That is, we want the possibility of (occasional) collisions, as that lets us turn quadratic into linear behavior.

In many environments we might find that setting the parameter \$M = 64\$ bits is a popular choice, well-aligned with the underlying CPU architecture, while still admitting of collisions only seldom.

BigInt

Common Lisp and other environments offer BigNums, which allocate more bits in order to reflect increasingly larger magnitudes. Unlike IEEE-754 floats, there's no approximation involved. A python int exactly implements a mathematician's notion of \$\mathbb{Z}\$. For any integer you'd care to name, python is happy to store an exact representation of it.

modulo

The chief difficulty with the ... * (26 ** foo) expression is that it's asking python to allocate roughly an additional five bits per input character. For a word that's not very long, that might be OK. Once you encounter longish words, elapsed time will take a hit.

What you wanted was to use a modulo operation, which limits how many bits the hash consumes. Define a large PRIME constant. On each iteration, back out the start character modulo PRIME, then incorporate the end character modulo PRIME.

97 offset

Subtracting ord("a") is weird.

Suppose that both inputs were in US-ASCII, we didn't assume "26 characters", and no offset was applied. Then your expression suggests we're trying to allocate seven bits per input character, and we shall represent the input exactly. That is, the function is a perfect hash, collision free, and we can recover the exact input text from a given "hash" value.

But we don't verify inputs are "a" .. "z". I anticipate some inputs might include SPACE, and "A" .. "Z". So the hash += ... is occasionally going to add a negative quantity, which makes analyzing the whole thing much messier than saying "there's seven bits per input character".

Consider ensuring that we shall always add a positive quantity, perhaps by removing that bias of 97. Consider case-smashing to .lower(). Consider defining a 27-char alphabet which includes SPACE, or perhaps a slightly larger alphabet, and then turn 26 ** ... into 27 ** ... or some larger base.

Summary:

We have added all of the numeric complexity of cramming lowercase letters into an integer, with none of the fixed-length benefits offered by hashing. The OP code's inner loop is almost comparing word against a similarly sized string window, which is significantly more costly than comparing a pair of 64-bit integers.

toolic toolic 14.5k5 gold badges29 silver badges203 bronze badges · Answer 2 · 2025-08-25 11:42:07Z

DRY

This constant expression is repeated several times:

ord("a")

It can be assigned to a constant in the __init__ function:

self.OFFSET = ord("a") + 1

This simplifies the code, and it also makes it more efficient since the ord function is only called once instead of every time through the for loop.

In the rabin_karp function, these expressions appear several times:

len(word)
len(text)

Again, set them to variables:

def rabin_karp(word, text):
 if word == "" or text == "":
 return None
 word_len = len(word)
 text_len = len(text)
 if word_len > text_len:
 return None

Simpler

In this line:

for i in range(0, sizeWord):

the default 0 start is not needed:

for i in range(sizeWord):

Naming

The PEP 8 style guide recommends snake_case for function and variable names.

sizeWord would be size_word

The variable named hash is the same name as a Python built-in function. This can be confusing. To eliminate the confusion, rename the variable as something like hash_val. The first clue is that "hash" has special coloring (syntax highlighting) in the question, as it does when I copy the code into my editor.

Documentation

The PEP 8 style guide recommends adding docstrings for classes and functions.

class RollingHash:
 """
 Briefly describe what rolling hash means and what the class is for
 """

For the rabin_karp function, the docstring should explain what the function does, the types of the inputs and what it returns. You could also use type hints for the inputs and return.

The name of the function could be improved. Currently, it is just the name of an algorithm, but it would be better if it reflected the action it is performing.

Stack Exchange Network

Rabin-Karp algorithm implementation to compare text

2 Answers 2

collisions

BigInt

modulo

97 offset

DRY

Simpler

Naming

Documentation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

2 Answers 2

collisions

BigInt

modulo

97 offset

Simpler

Naming

Documentation

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related