I have written a piece of code that finds common patterns in two strings. These patterns have to be in the same order, so for example "I am a person" and "A person I am" would only match "person". The code is crude, all characters, including whitespace and punctuation marks, receive the same treatment. The longest patterns are matched first.
The main function then returns two lists (one per string) of tuples with two elements. The first element is 1 if a substring has been matched, else 0. The second element is the substring.
So the format of returned lists will be like this:
[(1, 'I '), (0, 'am a person\n')]
[(1, 'I '), (0, 'can see\n')]
Now to the question -- what do you think about my code? I somehow feel that it's not quite top-notch, and I'm not an experienced coder. Any suggestions on coding style or the algorithms? I hope the code is reasonably clear.
The find_raw_patterns
function returns lists with alternating integers and strings, so the only thing find_common_patterns
does in addition to calling find_raw_patterns
, is to arrange the lists' elements in two-element-tuples.
The function longest_common_substring
is copied directly from Wikibooks, and I'm not very concerned about that function.
Code:
def longest_common_substring(S1, S2):
M = [[0]*(1+len(S2)) for i in range(1+len(S1))]
longest, x_longest = 0, 0
for x in range(1,1+len(S1)):
for y in range(1,1+len(S2)):
if S1[x-1] == S2[y-1]:
M[x][y] = M[x-1][y-1] + 1
if M[x][y]>longest:
longest = M[x][y]
x_longest = x
else:
M[x][y] = 0
return S1[x_longest-longest: x_longest]
def find_common_patterns(s1, s2):
arranged1 = []
arranged2 = []
(ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
for i in range(len(ptr1) - 1):
if type(ptr1[i]) == int:
arranged1.append((ptr1[i], ptr1[i+1]))
for i in range(len(ptr2) - 1):
if type(ptr2[i]) == int:
arranged2.append((ptr2[i], ptr2[i+1]))
return (arranged1, arranged2)
def find_raw_patterns(s1, s2): # used recursively
one = [] # used to reassemble strings, but with patterns and integer showing whether it's been matched or not
two = [] # same, but for the second string
com = longest_common_substring(s1, s2)
if len(com) < 2:
return ((0, s1), (0, s2))
elif len(com) >= 2:
i1 = s1.index(com)
i2 = s2.index(com)
s1_bef = s1[:i1] #part of string before the matched pattern
s1_aft = s1[i1 + len(com) : ] # -//- after matched pattern
s2_bef = s2[:i2]
s2_aft = s2[i2 + len(com) : ]
if len(s1_bef) > 0 and len(s2_bef) > 0: # find patterns in first parts of strings
res = find_raw_patterns(s1_bef, s2_bef)
one.extend(res[0])
two.extend(res[1])
one.extend((1, com)) # add current pattern
two.extend((1, com))
if len(s1_aft) > 0 and len(s2_aft) > 0: # find patterns from second parts
res = find_raw_patterns(s1_aft, s2_aft)
one.extend(res[0])
two.extend(res[1])
return (one, two)
1 Answer 1
You can simplify the two functions quite a bit. For find_raw_patterns
you can use partition
instead of manually splitting the strings up, and simplify the formation of the list.
def find_raw_patterns(s1, s2): # used recursively
if s1 == '' or s2 == '':
return [], []
com = longest_common_substring(s1, s2)
if len(com) < 2:
return ([0, s1], [0, s2])
s1_bef, _, s1_aft = s1.partition(com)
s2_bef, _, s2_aft = s2.partition(com)
before = find_raw_patterns(s1_bef, s2_bef)
after = find_raw_patterns(s1_aft, s2_aft)
return (before[0] + [1, com] + after[0],
before[1] + [1, com] + after[1])
For find_common_patterns
, you can use list indices to get the even and odd values rather than checking the type of each value, and use zip
to get a list of pair tuples.
def find_common_patterns(s1, s2):
(ptr1, ptr2) = find_raw_patterns(s1, s2) #ptr - pattern
return (zip(ptr1[::2], ptr1[1::2]),
zip(ptr2[::2], ptr2[1::2]))
But I think you can also combine the two functions with a minor adjustment to put the found values into paired tuples.
def find_common_patterns(s1, s2): # used recursively
if s1 == '' or s2 == '':
return [], []
com = longest_common_substring(s1, s2)
if len(com) < 2:
return ([(0, s1)], [(0, s2)])
s1_bef, _, s1_aft = s1.partition(com)
s2_bef, _, s2_aft = s2.partition(com)
before = find_common_patterns(s1_bef, s2_bef)
after = find_common_patterns(s1_aft, s2_aft)
return (before[0] + [(1, com)] + after[0],
before[1] + [(1, com)] + after[1])
-
\$\begingroup\$ Thank you! That makes the code a lot more compact and probably makes it easier to understand, too. There's a problem with the:
if s1 == '' or s2 == '': return [], []
It makes the result break off the other string when one string is "finished". Like this:>>> find_raw_patterns('aoesunthaoesnutheoau', 'oaeurolqsnjthkoaeun') ([0, 'aoes', 1, 'un'], [0, 'oaeurolqsnjthkoae', 1, 'un'])
I just removed that if clause and it solves the problem, just have to deal with possible empty strings in the list but that may be a fair trade-off. \$\endgroup\$5th– 5th2013年02月10日 10:46:27 +00:00Commented Feb 10, 2013 at 10:46 -
\$\begingroup\$ Okay, glad it helped. You may want to try something like
if s1 == '' and not s2 == '': return [], [(0, s2)]
, etc. \$\endgroup\$Stuart– Stuart2013年02月10日 11:14:20 +00:00Commented Feb 10, 2013 at 11:14 -
\$\begingroup\$ I'm a bit ashamed I didn't think of that myself, haha. The following worked fine:
if s1 == '' or s2 == '': return ([(0, s1)] if s1 != '' else []), ([(0, s2)] if s2 != '' else [])
\$\endgroup\$5th– 5th2013年02月10日 11:52:29 +00:00Commented Feb 10, 2013 at 11:52
a, b = "same order words", "not same but order words matched"
then[a[i:i+n] for i, _, n in difflib.SequenceMatcher(None, a, b).get_matching_blocks() if n]
returns['same', ' order words']
. It works with arbitrary hashable sequences, not just strings. \$\endgroup\$