3
\$\begingroup\$

I have made some improvement to Rabin Karp algorithm when the source and pattern only contain the characters from A to Z. The improvement is, I use base 26 for hashing (as there are total 26 characters). In this case, when the hashes match, there is no need to compare the actual string contents (since there are no hash collisions).

Below is my code in Python 2.7, any advice on code efficiency improvement (in terms of algorithm complexity perspective), bugs or code style is appreciated.

Source code in Python 2.7,

def rabin_karp_match(source, pattern):
 base = 26
 p_hash = 0
 s_hash = 0
 for i in pattern:
 p_hash = p_hash * base + ord(i)-ord('A')
 for j,v in enumerate(source):
 if j < len(pattern) - 1:
 s_hash = s_hash * base + ord(v)-ord('A')
 continue
 s_hash = s_hash * base + ord(v)-ord('A')
 if s_hash == p_hash:
 return j - len(pattern)+1
 s_hash = s_hash - (ord(source[j-len(pattern)+1]) - ord('A')) * base ** (len(pattern) - 1)
 return -1
if __name__ == "__main__":
 print rabin_karp_match('GACGCCA','CGC')
 print rabin_karp_match('GATACCCATCGAGTCGGATCGAGT', 'GAG')
 print rabin_karp_match('FOOBARWIDGET', 'WIDGETS')
asked Feb 2, 2017 at 6:18
\$\endgroup\$
1
  • \$\begingroup\$ Not sure if I can use base as 25 to be safe? \$\endgroup\$ Commented Feb 2, 2017 at 6:21

1 Answer 1

4
\$\begingroup\$

The code looks good, this probably improves the runtime in some cases, but I am not sure if hash matches happen often enough for this to be effective on average.

Efficiency/Logic

  • It's probably a good idea to check for the pattern size at the beginning of your method, to avoid overflows. This technically won't happen in python as it will be automatically change the storage to big-int, but the performance of that will be much worse.

  • Minor optimization for runtime is to save the value of base ** (len(pattern) - 1) before your loop. Exponentiation is an expensive function, so it's better to do it only one time.

Readability

  • Try to avoid one letter variables like i and j, except for numeric loop indices. A more expressive variable name would improve readability a lot.
  • Look for repeated parts in your code, and see if it makes sense to extract them into functions. One example in your code is calculating the letter index which is repeated multiple times.
  • Might make more sense to return None instead of -1 in case of no match.

Modified code

def letter_position(letter):
 return ord(letter) - ord('A')
def rabin_karp_match(source, pattern):
 base = 26
 p_hash = 0
 s_hash = 0
 for letter in pattern:
 p_hash = p_hash * base + letter_position(letter)
 first_letter_base = base ** (len(pattern) - 1)
 for i, letter in enumerate(source):
 s_hash = s_hash * base + letter_position(letter)
 i < len(pattern) - 1:
 continue
 
 match_start_index = i - len(pattern) + 1
 if s_hash == p_hash:
 return match_start_indexj
 s_hash = s_hash - letter_position(source[match_start_index]) * first_letter_base
 return None
answered Feb 3, 2017 at 2:07
\$\endgroup\$
6
  • \$\begingroup\$ Thanks Hesham, vote up. I think we can safely use base 25 other than 26, right? \$\endgroup\$ Commented Feb 3, 2017 at 6:15
  • 2
    \$\begingroup\$ You can't use base 25, since you have 26 different options. The possible values for any base range from 0 to (base - 1). So in your case, 'A' will be 0, and 'Z' will be 25. See en.wikipedia.org/wiki/Radix for more info. \$\endgroup\$ Commented Feb 3, 2017 at 7:24
  • \$\begingroup\$ Thanks Hesham, 1. for your comments, could you explain what means overflow by an example? Other language is fine. It's probably a good idea to check for the pattern size at the beginning of your method, to avoid overflows. Not sure if you mean if size of pattern larger than size of source, we just return immediately as invalid input. \$\endgroup\$ Commented Feb 5, 2017 at 5:30
  • \$\begingroup\$ 2. For your comments, 'Minor optimization for runtime is to save the value of base ** (len(pattern) - 1) before your loop', agree and nice point. \$\endgroup\$ Commented Feb 5, 2017 at 5:31
  • 1
    \$\begingroup\$ 3. For your comments, You can't use base 25, since you have 26 different options. , agree, nice catch. It is just like, for numbers from 0 to 9, which has 10 different values, we need to use base 10 other than base 9. Correct? \$\endgroup\$ Commented Feb 5, 2017 at 5:33

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.