Python 3: Finding common patterns in pairs of strings

Question 1

I have written a piece of code that finds common patterns in two strings. These patterns have to be in the same order, so for example "I am a person" and "A person I am" would only match "person". The code is crude, all characters, including whitespace and punctuation marks, receive the same treatment. The longest patterns are matched first.

The main function then returns two lists (one per string) of tuples with two elements. The first element is 1 if a substring has been matched, else 0. The second element is the substring.

So the format of returned lists will be like this:

[(1, 'I '), (0, 'am a person\n')]
[(1, 'I '), (0, 'can see\n')]

Now to the question -- what do you think about my code? I somehow feel that it's not quite top-notch, and I'm not an experienced coder. Any suggestions on coding style or the algorithms? I hope the code is reasonably clear.

The find_raw_patterns function returns lists with alternating integers and strings, so the only thing find_common_patterns does in addition to calling find_raw_patterns, is to arrange the lists' elements in two-element-tuples.

The function longest_common_substring is copied directly from Wikibooks, and I'm not very concerned about that function.

Code:

def longest_common_substring(S1, S2):
 M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
 longest, x_longest = 0, 0
 for x in range(1,1+len(S1)):
 for y in range(1,1+len(S2)):
 if S1[x-1] == S2[y-1]:
 M[x][y] = M[x-1][y-1] + 1
 if M[x][y]>longest:
 longest = M[x][y]
 x_longest = x
 else:
 M[x][y] = 0
 return S1[x_longest-longest: x_longest]
def find_common_patterns(s1, s2):
 arranged1 = []
 arranged2 = []
 (ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
 for i in range(len(ptr1) - 1):
 if type(ptr1[i]) == int:
 arranged1.append((ptr1[i], ptr1[i+1]))
 for i in range(len(ptr2) - 1):
 if type(ptr2[i]) == int:
 arranged2.append((ptr2[i], ptr2[i+1]))
 return (arranged1, arranged2)
def find_raw_patterns(s1, s2): # used recursively
 one = [] # used to reassemble strings, but with patterns and integer showing whether it's been matched or not
 two = [] # same, but for the second string
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ((0, s1), (0, s2))
 elif len(com) >= 2:
 i1 = s1.index(com)
 i2 = s2.index(com)
 s1_bef = s1[:i1] #part of string before the matched pattern
 s1_aft = s1[i1 + len(com) : ] # -//- after matched pattern
 s2_bef = s2[:i2]
 s2_aft = s2[i2 + len(com) : ]
 if len(s1_bef) > 0 and len(s2_bef) > 0: # find patterns in first parts of strings
 res = find_raw_patterns(s1_bef, s2_bef)
 one.extend(res[0])
 two.extend(res[1])
 one.extend((1, com)) # add current pattern
 two.extend((1, com))
 if len(s1_aft) > 0 and len(s2_aft) > 0: # find patterns from second parts
 res = find_raw_patterns(s1_aft, s2_aft)
 one.extend(res[0])
 two.extend(res[1])
 return (one, two)

Question 2

stdlib solution: if a, b = "same order words", "not same but order words matched" then [a[i:i+n] for i, _, n in difflib.SequenceMatcher(None, a, b).get_matching_blocks() if n] returns ['same', ' order words']. It works with arbitrary hashable sequences, not just strings.

Question 3

You can simplify the two functions quite a bit. For find_raw_patterns you can use partition instead of manually splitting the strings up, and simplify the formation of the list.

def find_raw_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([0, s1], [0, s2])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_raw_patterns(s1_bef, s2_bef)
 after = find_raw_patterns(s1_aft, s2_aft)
 return (before[0] + [1, com] + after[0],
 before[1] + [1, com] + after[1])

For find_common_patterns, you can use list indices to get the even and odd values rather than checking the type of each value, and use zip to get a list of pair tuples.

def find_common_patterns(s1, s2):
 (ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
 return (zip(ptr1[::2], ptr1[1::2]),
 zip(ptr2[::2], ptr2[1::2]))

But I think you can also combine the two functions with a minor adjustment to put the found values into paired tuples.

def find_common_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([(0, s1)], [(0, s2)])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_common_patterns(s1_bef, s2_bef)
 after = find_common_patterns(s1_aft, s2_aft)
 return (before[0] + [(1, com)] + after[0],
 before[1] + [(1, com)] + after[1])

Question 4

Thank you! That makes the code a lot more compact and probably makes it easier to understand, too. There's a problem with the: if s1 == '' or s2 == '': return [], [] It makes the result break off the other string when one string is "finished". Like this:

>>> find_raw_patterns('aoesunthaoesnutheoau', 'oaeurolqsnjthkoaeun') ([0, 'aoes', 1, 'un'], [0, 'oaeurolqsnjthkoae', 1, 'un'])

I just removed that if clause and it solves the problem, just have to deal with possible empty strings in the list but that may be a fair trade-off.

Question 5

Okay, glad it helped. You may want to try something like if s1 == '' and not s2 == '': return [], [(0, s2)], etc.

Question 6

I'm a bit ashamed I didn't think of that myself, haha. The following worked fine: if s1 == '' or s2 == '': return ([(0, s1)] if s1 != '' else []), ([(0, s2)] if s2 != '' else [])

Stuart Stuart 2,86014 silver badges20 bronze badges · Accepted Answer · 2013-02-10 09:47:12Z

You can simplify the two functions quite a bit. For find_raw_patterns you can use partition instead of manually splitting the strings up, and simplify the formation of the list.

def find_raw_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([0, s1], [0, s2])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_raw_patterns(s1_bef, s2_bef)
 after = find_raw_patterns(s1_aft, s2_aft)
 return (before[0] + [1, com] + after[0],
 before[1] + [1, com] + after[1])

For find_common_patterns, you can use list indices to get the even and odd values rather than checking the type of each value, and use zip to get a list of pair tuples.

def find_common_patterns(s1, s2):
 (ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
 return (zip(ptr1[::2], ptr1[1::2]),
 zip(ptr2[::2], ptr2[1::2]))

But I think you can also combine the two functions with a minor adjustment to put the found values into paired tuples.

def find_common_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([(0, s1)], [(0, s2)])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_common_patterns(s1_bef, s2_bef)
 after = find_common_patterns(s1_aft, s2_aft)
 return (before[0] + [(1, com)] + after[0],
 before[1] + [(1, com)] + after[1])

Thank you! That makes the code a lot more compact and probably makes it easier to understand, too. There's a problem with the: if s1 == '' or s2 == '': return [], [] It makes the result break off the other string when one string is "finished". Like this: >>> find_raw_patterns('aoesunthaoesnutheoau', 'oaeurolqsnjthkoaeun') ([0, 'aoes', 1, 'un'], [0, 'oaeurolqsnjthkoae', 1, 'un']) I just removed that if clause and it solves the problem, just have to deal with possible empty strings in the list but that may be a fair trade-off.
Okay, glad it helped. You may want to try something like if s1 == '' and not s2 == '': return [], [(0, s2)], etc.
I'm a bit ashamed I didn't think of that myself, haha. The following worked fine: if s1 == '' or s2 == '': return ([(0, s1)] if s1 != '' else []), ([(0, s2)] if s2 != '' else [])

Stack Exchange Network

Python 3: Finding common patterns in pairs of strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python 3: Finding common patterns in pairs of strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions