3
\$\begingroup\$

I have a cost function Cost_Func(pt,STRING) that calculates value in a specific way based on the given pattern string pt and main string STRING.

A mathematical expression ( y := y-11*int(STRING[i-length]))/6+ int(STRING[i])*pow(11,length), this is the main calculation ) is used,

and the intuition is, it will be faster than KMP string matching (to compare I present here the function KMPSearch(pt,STRING)),

but to my surprise, it is not.

I thought KMP function would have to use a lot of value-comparison, while the expression y runs through the loop once.

In the hope of better performance, I used assignment expressions in list comprehension (Python 3.8), but the function is still slower than the KMP matching function. I need to be faster than KMP function, otherwise the cost function will not be accepted.

txt = "010110101010101010100000000000111111111111111111111111111111111110000"
pat = "01000000000001"
import time
def Cost_Func(pt,STRING):
 x=y=0
 h,length,cnst=[],len(pt),3
 [(x := x+pow(11*int(pt[i]),(i+1)),y := y+pow(11*int(STRING[i]),(i+1))) for i in range(length)]
 h=[((i-length+1)) for i in range(length,len(STRING)) if (y := (y-11*int(STRING[i-length]))/6+ int(STRING[i])*pow(11,length))>x/cnst]
 return h
def KMPSearch(pt, tt):
 k=[]
 M = len(pat)
 N = len(txt)
 lps = [0]*M
 j = 0
 computeLPSArray(pat, M, lps)
 i = 0 
 while i < N:
 if pat[j] == txt[i]:
 i += 1
 j += 1
 if j == M:
 k.append(i-j)
 j = lps[j-1]
 elif i < N and pat[j] != txt[i]:
 if j != 0:
 j = lps[j-1]
 else:
 i += 1
 return k
def computeLPSArray(pat, M, lps):
 len = 0 
 lps[0] 
 i = 1
 while i < M:
 if pat[i]== pat[len]:
 len += 1
 lps[i] = len
 i += 1
 else: 
 if len != 0:
 len = lps[len-1] 
 else:
 lps[i] = 0
 i += 1
k=200
t0 = time.time() 
for i in range(1,k):
 Cost_Func(pat,txt)
 t1 = time.time()
for i in range(1,k): 
 KMPSearch(pat, txt)
 t2 = time.time()
print("Cost_func:",(t1-t0)/k*10000,"KMP:",(t2-t1)/k*10000,k)

Is there a faster way to complete the Cost_Func(pt,STRING)?

For example, would library like NumPy help? Is there any other technique?


The source of the code of KMP function is taken from here.

Reinderien
70.9k5 gold badges76 silver badges256 bronze badges
asked Jun 30, 2022 at 13:17
\$\endgroup\$
5
  • \$\begingroup\$ Is this function supposed to work only for strings made of 0s and 1s? \$\endgroup\$ Commented Jun 30, 2022 at 19:40
  • \$\begingroup\$ @IEatBagels yes, but the coefficient might vary, it is related to a Machine learning component, the idea is, everything is expressed in binary so, if we can calculate for binary string, we are actually doing it for all type of sting. \$\endgroup\$ Commented Jun 30, 2022 at 20:11
  • \$\begingroup\$ But, if I understood correctly, you'll have to convert your string to a binary string when you want to look for the pattern? If so, the time needed to do this would overshadow the performance gain you'd have from your algorithm, that's without considering the encoding issues when converting to binary string, no? \$\endgroup\$ Commented Jun 30, 2022 at 20:14
  • \$\begingroup\$ @Michael First: I implore you to find better sources of learning than Geeks for Geeks and "Md. Tahmid Hossain". They contain advice ranging from good, to bad, to aggressively bad. \$\endgroup\$ Commented Jul 3, 2022 at 12:32
  • 1
    \$\begingroup\$ Second: @200_success made a rollback with description After an answer has been posted, please do not alter the code or add rebuttals in the question. This was appropriate and your rollback of the rollback was not. \$\endgroup\$ Commented Jul 3, 2022 at 12:33

1 Answer 1

6
\$\begingroup\$

Without saying this lightly, this code is a proper nightmare. Here is a list of things that do not make your code faster:

  • Combining variable assignments on one line
  • Writing a list comprehension used for its side-effects and then throwing away the result of the comprehension
  • Removing all whitespace from comprehensions
  • Hyper-abbreviating variable names

So don't do any of those. Add type hints, follow PEP8 for your variable and function names and whitespace, add a __main__ guard, add unit tests, expand your first comprehension to a plain loop, replace pow() with the ** operator, loop more pythonically via zip, make compute_lps_array() return instead of mutate, and you get what I consider to be a bare-minimum first refactor:

from typing import Literal, Sequence
PatternT = Sequence[Literal['0', '1']]
def cost_func(pattern: PatternT, string: PatternT) -> list[int]:
 x = y = 0
 length = len(pattern)
 const = 3
 for i, (p, s) in enumerate(zip(pattern, string), start=1):
 x += (11 * int(p)) ** i
 y += (11 * int(s)) ** i
 h = [
 (i - length + 1)
 for i in range(length, len(string))
 if (
 y := (y - 11 * int(string[i - length])) / 6 + int(string[i]) * 11**length
 ) > x / const
 ]
 return h
def kmp_search(pattern: PatternT, text: PatternT) -> list[int]:
 i = j = 0
 k = []
 M = len(pattern)
 N = len(text)
 lps = compute_lps_array(pattern, M)
 while i < N:
 if pattern[j] == text[i]:
 i += 1
 j += 1
 if j == M:
 k.append(i - j)
 j = lps[j - 1]
 elif i < N and pattern[j] != text[i]:
 if j != 0:
 j = lps[j - 1]
 else:
 i += 1
 return k
def compute_lps_array(pattern: PatternT, M: int) -> list[int]:
 lps = [0] * M
 len = 0
 i = 1
 while i < M:
 if pattern[i] == pattern[len]:
 len += 1
 lps[i] = len
 i += 1
 elif len != 0:
 len = lps[len - 1]
 else:
 lps[i] = 0
 i += 1
 return lps
def test() -> None:
 text = "010110101010101010100000000000111111111111111111111111111111111110000"
 pattern = "01000000000001"
 assert cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 assert kmp_search(pattern, text) == [17]
if __name__ == '__main__':
 test()

The KMP search doesn't lend itself well to Numpy vectorisation. If you really need this to be faster, don't use Python. If you want it to be Python-compatible, write it in C with an FFI to Python. It's probable that someone has already done this for you..

On the contrary, cost_func is definitely vectorisable. Recognise that your predicate y := ... > x/const is a recurrence relation, and one that has a pretty easy solution. In my very limited testing, this is equivalent:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/30069366/313768
 pot = 6.**np.arange(string.size - pattern.size + 1)
 revpot = pot[-2::-1]
 ycomp = np.cumsum(addends/revpot)*revpot + y0/pot[1:]
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1
def string_to_array(string: Sequence[Literal['0', '1']]) -> np.ndarray:
 return np.array([
 int(c) for c in string
 ])
def test() -> None:
 pattern = string_to_array("01000000000001")
 text = string_to_array("010110101010101010100000000000111111111111111111111111111111111110000")
 assert np.all(
 cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 )
if __name__ == '__main__':
 test()

In the recurrence question I linked, there is another, simpler solution to express the recurrence relation with a single function call, scipy.signal.lfilter. I have not tested its performance, but I somewhat trust its numerical stability to be better and the results are the same:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/53705423/313768
 ycomp, _ = scipy.signal.lfilter(
 b=[1, 0], a=[1, -1/6], x=addends, zi=[y0/6],
 )
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1

Also on numerical stability: your quantities are enormous, even in this small example going past 4e14. Strongly consider operating in log-domain math instead.

answered Jun 30, 2022 at 15:43
\$\endgroup\$
7
  • \$\begingroup\$ If one is getting " 'type' object is not subscriptable ": stackoverflow.com/a/63460173/3989930 \$\endgroup\$ Commented Jun 30, 2022 at 17:09
  • 1
    \$\begingroup\$ Speed isn't the point of the first refactor. It's disentangling the illegible code in the original post. \$\endgroup\$ Commented Jun 30, 2022 at 18:20
  • \$\begingroup\$ I am getting AttributeError: 'str' object has no attribute 'size' (imported numpy as np) so I replaced with sz_pt=len(pattern) and sz_tt=len(string) but I am getting string[sz_pt:] * 11**sz_pt - ....OverflowError: cannot fit 'int' into an index-sized integer, could you suggest how to solve it? \$\endgroup\$ Commented Jun 30, 2022 at 21:16
  • \$\begingroup\$ @Michael My mistake; I forgot to show invocation. Please see the edit. \$\endgroup\$ Commented Jun 30, 2022 at 21:49
  • \$\begingroup\$ Did you mean use of math.log(a,Base) when you said log-domain math? for example ycomp = math.log(ycomp,Base)? If I do that, will I be able to do Equality Comparison (==), for example, like ycomp == x/const ? Also, is there anything else we can do to improve (of course, your first code is faster than mine, I am exploring that, haven't seen 2nd one) besides manipulation related to vectorization? \$\endgroup\$ Commented Jul 2, 2022 at 14:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.