4
\$\begingroup\$

I have written a piece of code that finds common patterns in two strings. These patterns have to be in the same order, so for example "I am a person" and "A person I am" would only match "person". The code is crude, all characters, including whitespace and punctuation marks, receive the same treatment. The longest patterns are matched first.

The main function then returns two lists (one per string) of tuples with two elements. The first element is 1 if a substring has been matched, else 0. The second element is the substring.

So the format of returned lists will be like this:

[(1, 'I '), (0, 'am a person\n')]
[(1, 'I '), (0, 'can see\n')]

Now to the question -- what do you think about my code? I somehow feel that it's not quite top-notch, and I'm not an experienced coder. Any suggestions on coding style or the algorithms? I hope the code is reasonably clear.

The find_raw_patterns function returns lists with alternating integers and strings, so the only thing find_common_patterns does in addition to calling find_raw_patterns, is to arrange the lists' elements in two-element-tuples.

The function longest_common_substring is copied directly from Wikibooks, and I'm not very concerned about that function.

Code:

def longest_common_substring(S1, S2):
 M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
 longest, x_longest = 0, 0
 for x in range(1,1+len(S1)):
 for y in range(1,1+len(S2)):
 if S1[x-1] == S2[y-1]:
 M[x][y] = M[x-1][y-1] + 1
 if M[x][y]>longest:
 longest = M[x][y]
 x_longest = x
 else:
 M[x][y] = 0
 return S1[x_longest-longest: x_longest]
def find_common_patterns(s1, s2):
 arranged1 = []
 arranged2 = []
 (ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
 for i in range(len(ptr1) - 1):
 if type(ptr1[i]) == int:
 arranged1.append((ptr1[i], ptr1[i+1]))
 for i in range(len(ptr2) - 1):
 if type(ptr2[i]) == int:
 arranged2.append((ptr2[i], ptr2[i+1]))
 return (arranged1, arranged2)
def find_raw_patterns(s1, s2): # used recursively
 one = [] # used to reassemble strings, but with patterns and integer showing whether it's been matched or not
 two = [] # same, but for the second string
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ((0, s1), (0, s2))
 elif len(com) >= 2:
 i1 = s1.index(com)
 i2 = s2.index(com)
 s1_bef = s1[:i1] #part of string before the matched pattern
 s1_aft = s1[i1 + len(com) : ] # -//- after matched pattern
 s2_bef = s2[:i2]
 s2_aft = s2[i2 + len(com) : ]
 if len(s1_bef) > 0 and len(s2_bef) > 0: # find patterns in first parts of strings
 res = find_raw_patterns(s1_bef, s2_bef)
 one.extend(res[0])
 two.extend(res[1])
 one.extend((1, com)) # add current pattern
 two.extend((1, com))
 if len(s1_aft) > 0 and len(s2_aft) > 0: # find patterns from second parts
 res = find_raw_patterns(s1_aft, s2_aft)
 one.extend(res[0])
 two.extend(res[1])
 return (one, two)
asked Feb 10, 2013 at 6:16
\$\endgroup\$
1
  • 1
    \$\begingroup\$ stdlib solution: if a, b = "same order words", "not same but order words matched" then [a[i:i+n] for i, _, n in difflib.SequenceMatcher(None, a, b).get_matching_blocks() if n] returns ['same', ' order words']. It works with arbitrary hashable sequences, not just strings. \$\endgroup\$ Commented Feb 26, 2014 at 21:33

1 Answer 1

3
\$\begingroup\$

You can simplify the two functions quite a bit. For find_raw_patterns you can use partition instead of manually splitting the strings up, and simplify the formation of the list.

def find_raw_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([0, s1], [0, s2])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_raw_patterns(s1_bef, s2_bef)
 after = find_raw_patterns(s1_aft, s2_aft)
 return (before[0] + [1, com] + after[0],
 before[1] + [1, com] + after[1])

For find_common_patterns, you can use list indices to get the even and odd values rather than checking the type of each value, and use zip to get a list of pair tuples.

def find_common_patterns(s1, s2):
 (ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
 return (zip(ptr1[::2], ptr1[1::2]),
 zip(ptr2[::2], ptr2[1::2]))

But I think you can also combine the two functions with a minor adjustment to put the found values into paired tuples.

def find_common_patterns(s1, s2): # used recursively
 if s1 == '' or s2 == '':
 return [], []
 com = longest_common_substring(s1, s2)
 if len(com) < 2:
 return ([(0, s1)], [(0, s2)])
 s1_bef, _, s1_aft = s1.partition(com)
 s2_bef, _, s2_aft = s2.partition(com)
 before = find_common_patterns(s1_bef, s2_bef)
 after = find_common_patterns(s1_aft, s2_aft)
 return (before[0] + [(1, com)] + after[0],
 before[1] + [(1, com)] + after[1])
answered Feb 10, 2013 at 9:47
\$\endgroup\$
3
  • \$\begingroup\$ Thank you! That makes the code a lot more compact and probably makes it easier to understand, too. There's a problem with the: if s1 == '' or s2 == '': return [], [] It makes the result break off the other string when one string is "finished". Like this: >>> find_raw_patterns('aoesunthaoesnutheoau', 'oaeurolqsnjthkoaeun') ([0, 'aoes', 1, 'un'], [0, 'oaeurolqsnjthkoae', 1, 'un']) I just removed that if clause and it solves the problem, just have to deal with possible empty strings in the list but that may be a fair trade-off. \$\endgroup\$ Commented Feb 10, 2013 at 10:46
  • \$\begingroup\$ Okay, glad it helped. You may want to try something like if s1 == '' and not s2 == '': return [], [(0, s2)], etc. \$\endgroup\$ Commented Feb 10, 2013 at 11:14
  • \$\begingroup\$ I'm a bit ashamed I didn't think of that myself, haha. The following worked fine: if s1 == '' or s2 == '': return ([(0, s1)] if s1 != '' else []), ([(0, s2)] if s2 != '' else []) \$\endgroup\$ Commented Feb 10, 2013 at 11:52

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.