Applying a cost function to a string based on a pattern string

Question 1

I have a cost function Cost_Func(pt,STRING) that calculates value in a specific way based on the given pattern string pt and main string STRING.

A mathematical expression ( y := y-11*int(STRING[i-length]))/6+ int(STRING[i])*pow(11,length), this is the main calculation ) is used,

and the intuition is, it will be faster than KMP string matching (to compare I present here the function KMPSearch(pt,STRING)),

but to my surprise, it is not.

I thought KMP function would have to use a lot of value-comparison, while the expression y runs through the loop once.

In the hope of better performance, I used assignment expressions in list comprehension (Python 3.8), but the function is still slower than the KMP matching function. I need to be faster than KMP function, otherwise the cost function will not be accepted.

txt = "010110101010101010100000000000111111111111111111111111111111111110000"
pat = "01000000000001"
import time
def Cost_Func(pt,STRING):
 x=y=0
 h,length,cnst=[],len(pt),3
 [(x := x+pow(11*int(pt[i]),(i+1)),y := y+pow(11*int(STRING[i]),(i+1))) for i in range(length)]
 h=[((i-length+1)) for i in range(length,len(STRING)) if (y := (y-11*int(STRING[i-length]))/6+ int(STRING[i])*pow(11,length))>x/cnst]
 return h
def KMPSearch(pt, tt):
 k=[]
 M = len(pat)
 N = len(txt)
 lps = [0]*M
 j = 0
 computeLPSArray(pat, M, lps)
 i = 0 
 while i < N:
 if pat[j] == txt[i]:
 i += 1
 j += 1
 if j == M:
 k.append(i-j)
 j = lps[j-1]
 elif i < N and pat[j] != txt[i]:
 if j != 0:
 j = lps[j-1]
 else:
 i += 1
 return k
def computeLPSArray(pat, M, lps):
 len = 0 
 lps[0] 
 i = 1
 while i < M:
 if pat[i]== pat[len]:
 len += 1
 lps[i] = len
 i += 1
 else: 
 if len != 0:
 len = lps[len-1] 
 else:
 lps[i] = 0
 i += 1
k=200
t0 = time.time() 
for i in range(1,k):
 Cost_Func(pat,txt)
 t1 = time.time()
for i in range(1,k): 
 KMPSearch(pat, txt)
 t2 = time.time()
print("Cost_func:",(t1-t0)/k*10000,"KMP:",(t2-t1)/k*10000,k)

Is there a faster way to complete the Cost_Func(pt,STRING)?

For example, would library like NumPy help? Is there any other technique?

The source of the code of KMP function is taken from here.

Question 2

Is this function supposed to work only for strings made of 0s and 1s?

Question 3

@IEatBagels yes, but the coefficient might vary, it is related to a Machine learning component, the idea is, everything is expressed in binary so, if we can calculate for binary string, we are actually doing it for all type of sting.

Question 4

But, if I understood correctly, you'll have to convert your string to a binary string when you want to look for the pattern? If so, the time needed to do this would overshadow the performance gain you'd have from your algorithm, that's without considering the encoding issues when converting to binary string, no?

Question 5

@Michael First: I implore you to find better sources of learning than Geeks for Geeks and "Md. Tahmid Hossain". They contain advice ranging from good, to bad, to aggressively bad.

Question 6

Second: @200_success made a rollback with description After an answer has been posted, please do not alter the code or add rebuttals in the question. This was appropriate and your rollback of the rollback was not.

Question 7

Without saying this lightly, this code is a proper nightmare. Here is a list of things that do not make your code faster:

Combining variable assignments on one line
Writing a list comprehension used for its side-effects and then throwing away the result of the comprehension
Removing all whitespace from comprehensions
Hyper-abbreviating variable names

So don't do any of those. Add type hints, follow PEP8 for your variable and function names and whitespace, add a __main__ guard, add unit tests, expand your first comprehension to a plain loop, replace pow() with the ** operator, loop more pythonically via zip, make compute_lps_array() return instead of mutate, and you get what I consider to be a bare-minimum first refactor:

from typing import Literal, Sequence
PatternT = Sequence[Literal['0', '1']]
def cost_func(pattern: PatternT, string: PatternT) -> list[int]:
 x = y = 0
 length = len(pattern)
 const = 3
 for i, (p, s) in enumerate(zip(pattern, string), start=1):
 x += (11 * int(p)) ** i
 y += (11 * int(s)) ** i
 h = [
 (i - length + 1)
 for i in range(length, len(string))
 if (
 y := (y - 11 * int(string[i - length])) / 6 + int(string[i]) * 11**length
 ) > x / const
 ]
 return h
def kmp_search(pattern: PatternT, text: PatternT) -> list[int]:
 i = j = 0
 k = []
 M = len(pattern)
 N = len(text)
 lps = compute_lps_array(pattern, M)
 while i < N:
 if pattern[j] == text[i]:
 i += 1
 j += 1
 if j == M:
 k.append(i - j)
 j = lps[j - 1]
 elif i < N and pattern[j] != text[i]:
 if j != 0:
 j = lps[j - 1]
 else:
 i += 1
 return k
def compute_lps_array(pattern: PatternT, M: int) -> list[int]:
 lps = [0] * M
 len = 0
 i = 1
 while i < M:
 if pattern[i] == pattern[len]:
 len += 1
 lps[i] = len
 i += 1
 elif len != 0:
 len = lps[len - 1]
 else:
 lps[i] = 0
 i += 1
 return lps
def test() -> None:
 text = "010110101010101010100000000000111111111111111111111111111111111110000"
 pattern = "01000000000001"
 assert cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 assert kmp_search(pattern, text) == [17]
if __name__ == '__main__':
 test()

The KMP search doesn't lend itself well to Numpy vectorisation. If you really need this to be faster, don't use Python. If you want it to be Python-compatible, write it in C with an FFI to Python. It's probable that someone has already done this for you..

On the contrary, cost_func is definitely vectorisable. Recognise that your predicate y := ... > x/const is a recurrence relation, and one that has a pretty easy solution. In my very limited testing, this is equivalent:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/30069366/313768
 pot = 6.**np.arange(string.size - pattern.size + 1)
 revpot = pot[-2::-1]
 ycomp = np.cumsum(addends/revpot)*revpot + y0/pot[1:]
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1
def string_to_array(string: Sequence[Literal['0', '1']]) -> np.ndarray:
 return np.array([
 int(c) for c in string
 ])
def test() -> None:
 pattern = string_to_array("01000000000001")
 text = string_to_array("010110101010101010100000000000111111111111111111111111111111111110000")
 assert np.all(
 cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 )
if __name__ == '__main__':
 test()

In the recurrence question I linked, there is another, simpler solution to express the recurrence relation with a single function call, scipy.signal.lfilter. I have not tested its performance, but I somewhat trust its numerical stability to be better and the results are the same:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/53705423/313768
 ycomp, _ = scipy.signal.lfilter(
 b=[1, 0], a=[1, -1/6], x=addends, zi=[y0/6],
 )
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1

Also on numerical stability: your quantities are enormous, even in this small example going past 4e14. Strongly consider operating in log-domain math instead.

Question 8

If one is getting " 'type' object is not subscriptable ": stackoverflow.com/a/63460173/3989930

Question 9

Speed isn't the point of the first refactor. It's disentangling the illegible code in the original post.

Question 10

I am getting AttributeError: 'str' object has no attribute 'size' (imported numpy as np) so I replaced with sz_pt=len(pattern) and sz_tt=len(string) but I am getting string[sz_pt:] * 11**sz_pt - ....OverflowError: cannot fit 'int' into an index-sized integer, could you suggest how to solve it?

Question 11

@Michael My mistake; I forgot to show invocation. Please see the edit.

Question 12

Did you mean use of math.log(a,Base) when you said log-domain math? for example ycomp = math.log(ycomp,Base)? If I do that, will I be able to do Equality Comparison (==), for example, like ycomp == x/const ? Also, is there anything else we can do to improve (of course, your first code is faster than mine, I am exploring that, haven't seen 2nd one) besides manipulation related to vectorization?

Reinderien Reinderien 70.9k5 gold badges76 silver badges256 bronze badges · Accepted Answer · 2022-06-30 15:43:13Z

Without saying this lightly, this code is a proper nightmare. Here is a list of things that do not make your code faster:

Combining variable assignments on one line
Writing a list comprehension used for its side-effects and then throwing away the result of the comprehension
Removing all whitespace from comprehensions
Hyper-abbreviating variable names

So don't do any of those. Add type hints, follow PEP8 for your variable and function names and whitespace, add a __main__ guard, add unit tests, expand your first comprehension to a plain loop, replace pow() with the ** operator, loop more pythonically via zip, make compute_lps_array() return instead of mutate, and you get what I consider to be a bare-minimum first refactor:

from typing import Literal, Sequence
PatternT = Sequence[Literal['0', '1']]
def cost_func(pattern: PatternT, string: PatternT) -> list[int]:
 x = y = 0
 length = len(pattern)
 const = 3
 for i, (p, s) in enumerate(zip(pattern, string), start=1):
 x += (11 * int(p)) ** i
 y += (11 * int(s)) ** i
 h = [
 (i - length + 1)
 for i in range(length, len(string))
 if (
 y := (y - 11 * int(string[i - length])) / 6 + int(string[i]) * 11**length
 ) > x / const
 ]
 return h
def kmp_search(pattern: PatternT, text: PatternT) -> list[int]:
 i = j = 0
 k = []
 M = len(pattern)
 N = len(text)
 lps = compute_lps_array(pattern, M)
 while i < N:
 if pattern[j] == text[i]:
 i += 1
 j += 1
 if j == M:
 k.append(i - j)
 j = lps[j - 1]
 elif i < N and pattern[j] != text[i]:
 if j != 0:
 j = lps[j - 1]
 else:
 i += 1
 return k
def compute_lps_array(pattern: PatternT, M: int) -> list[int]:
 lps = [0] * M
 len = 0
 i = 1
 while i < M:
 if pattern[i] == pattern[len]:
 len += 1
 lps[i] = len
 i += 1
 elif len != 0:
 len = lps[len - 1]
 else:
 lps[i] = 0
 i += 1
 return lps
def test() -> None:
 text = "010110101010101010100000000000111111111111111111111111111111111110000"
 pattern = "01000000000001"
 assert cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 assert kmp_search(pattern, text) == [17]
if __name__ == '__main__':
 test()

The KMP search doesn't lend itself well to Numpy vectorisation. If you really need this to be faster, don't use Python. If you want it to be Python-compatible, write it in C with an FFI to Python. It's probable that someone has already done this for you..

On the contrary, cost_func is definitely vectorisable. Recognise that your predicate y := ... > x/const is a recurrence relation, and one that has a pretty easy solution. In my very limited testing, this is equivalent:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/30069366/313768
 pot = 6.**np.arange(string.size - pattern.size + 1)
 revpot = pot[-2::-1]
 ycomp = np.cumsum(addends/revpot)*revpot + y0/pot[1:]
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1
def string_to_array(string: Sequence[Literal['0', '1']]) -> np.ndarray:
 return np.array([
 int(c) for c in string
 ])
def test() -> None:
 pattern = string_to_array("01000000000001")
 text = string_to_array("010110101010101010100000000000111111111111111111111111111111111110000")
 assert np.all(
 cost_func(pattern, text) == [
 1, 3, 5, 17, 18, 19,
 20, 21, 22, 23, 24, 25, 26, 27, 28, 29,
 30, 31, 32, 33, 34, 35, 36, 37, 38, 39,
 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,
 50, 51
 ]
 )
if __name__ == '__main__':
 test()

In the recurrence question I linked, there is another, simpler solution to express the recurrence relation with a single function call, scipy.signal.lfilter. I have not tested its performance, but I somewhat trust its numerical stability to be better and the results are the same:

def cost_func(pattern: np.ndarray, string: np.ndarray) -> np.ndarray:
 x = (11. ** (np.nonzero(pattern)[0] + 1)).sum()
 y0 = (11. ** (np.nonzero(string[:pattern.size])[0] + 1)).sum()
 addends = (
 string[pattern.size:] * 11**pattern.size -
 11/6 * string[:-pattern.size]
 )
 # Inspired by linear recurrence relation solution at
 # https://stackoverflow.com/a/53705423/313768
 ycomp, _ = scipy.signal.lfilter(
 b=[1, 0], a=[1, -1/6], x=addends, zi=[y0/6],
 )
 const = 3
 indices, = (ycomp > x/const).nonzero()
 return indices + 1

Also on numerical stability: your quantities are enormous, even in this small example going past 4e14. Strongly consider operating in log-domain math instead.

If one is getting " 'type' object is not subscriptable ": stackoverflow.com/a/63460173/3989930
Speed isn't the point of the first refactor. It's disentangling the illegible code in the original post.
I am getting AttributeError: 'str' object has no attribute 'size' (imported numpy as np) so I replaced with sz_pt=len(pattern) and sz_tt=len(string) but I am getting string[sz_pt:] * 11**sz_pt - ....OverflowError: cannot fit 'int' into an index-sized integer, could you suggest how to solve it?
@Michael My mistake; I forgot to show invocation. Please see the edit.
Did you mean use of math.log(a,Base) when you said log-domain math? for example ycomp = math.log(ycomp,Base)? If I do that, will I be able to do Equality Comparison (==), for example, like ycomp == x/const ? Also, is there anything else we can do to improve (of course, your first code is faster than mine, I am exploring that, haven't seen 2nd one) besides manipulation related to vectorization?

Stack Exchange Network

Applying a cost function to a string based on a pattern string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Applying a cost function to a string based on a pattern string

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions