I am trying to combine two programs that analyze strings into one program. The result should look at a parent string and determine if a sub-string has an exact match or a "close" match within the parent string; if it does it returns the location of this match
i.e. parent: abcdefefefef sub cde is exact and returns 2 sub efe is exact and returns 4, 6, 8 sub deg is close and returns 3 sub z has no match and returns None
The first program I wrote locates exact matches of a sub-string within a parent string:
import string
def subStringMatchExact():
print "This program will index the locations a given sequence"
print "occurs within a larger sequence"
seq = raw_input("Please input a sequence to search within: ")
sub = raw_input("Please input a sequence to search for: ")
n = 0
for i in range(0,len(seq)):
x = string.find(seq, sub, n, len(seq))
if x == -1:
break
print x
n = x + 1
The second analyzes whether or not there is an close match within a parent string and returns True if there is, and false if there isn't. Again a close match is an exact match or one that differs by only one character.
import string
def similarstrings():
print "This program will determine whether two strings differ"
print "by more than one character. It will return True when they"
print "are the same or differ by one character; otherwise it will"
print "return False"
str1 = raw_input("Enter first string:")
str2 = raw_input("Enter second string:")
x = 0
for i in range(len(str1)):
if str1[i] == str2[i]:
x = x + 1
else:
x = x
if x >= len(str1) - 1:
print True
else:
print False
Any suggestions on how to combine the two would be much appreciated.
1 Answer 1
Firstly, a quick review of what you've done:
import string
is unnecessary really. You utilize string.find
, but you can simply call the find method on your first argument, that is, seq.find(sub, n, len(seq))
.
In your second method, you have:
if str1[i] == str2[i]:
x = x + 1
else:
x = x
The else
statement is completely redundant here, you don't need it.
I think we can simplify the logic quite a bit here. I'm going to ignore raw_input
and take the strings as arguments instead, but it doesn't change the logic (and is arguably better style anyway). Note that this is not fantastically optimized, it can definitely be made faster:
def similar_strings(s1, s2):
'''Fuzzy matches s1 with s2, where fuzzy match is defined as
either all of s2 being contained somewhere within s1, or
a difference of only one character. Note if len(s2) == 1, then
the character represented by s2 must be found within s1 for a match
to occur.
Returns a list with the starting indices of all matches.
'''
block_size = len(s2)
match = []
for j in range(0, len(s1) - block_size):
diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)]
if diff.count(0) and diff.count(0) >= (block_size - 1):
match.append(j)
if not match:
return None
return match
How does this work? Well, it basically creates a "window" of size len(s2)
which shifts along the original string (s1
). Within this window, we take a slice of our original string (s1[j:j+block_size]
) which is of length len(s2)
. zip
creates a list of tuples, so, for example, zip(['a','b','c'], ['x','y','z'])
will give us back [(a, x), (b, y), (c, z)
].
Then, we make a difference of all the characters from our s2
sized block of s1
and our actual s2
- ord
gives us the (integer) value of a character. So the rather tricky line diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)]
will give us a list of size len(s2)
which has all the differences between characters. As an example, with s1
being abcdefefefef
and s2
being cde
, our first run through will give us the slice abc
to compare with cde
- which gives us [-2, -2, -2]
since all characters are a distance of 2 apart.
So, given diff
, we want to see how many 0's it has, as two characters will be the same when their ord
difference is 0. However, we need to make sure there is at least 1 character the same in our block (if you simply pass in z
, or any string length 1, diff will always have length 1, and block_size - 1
will be 0. Hence diff.count(0) >= 0
will always be True
, which is not what we want). Thus diff.count(0)
will only be true if it is greater than 0, which protects us from this. When that's true, look at the block size, make sure it is at least len(s2) - 1
, and if so, add that index to match
.
Hopefully this gives you an idea of how you'd do this, and I hope the explanation is detailed enough to make sense of some python constructs you may not have seen before.