Python: Combing two programs that analyze strings

Question 1

I am trying to combine two programs that analyze strings into one program. The result should look at a parent string and determine if a sub-string has an exact match or a "close" match within the parent string; if it does it returns the location of this match

i.e. parent: abcdefefefef sub cde is exact and returns 2 sub efe is exact and returns 4, 6, 8 sub deg is close and returns 3 sub z has no match and returns None

The first program I wrote locates exact matches of a sub-string within a parent string:

import string
def subStringMatchExact():
 print "This program will index the locations a given sequence"
 print "occurs within a larger sequence"
 seq = raw_input("Please input a sequence to search within: ")
 sub = raw_input("Please input a sequence to search for: ")
 n = 0
 for i in range(0,len(seq)):
 x = string.find(seq, sub, n, len(seq))
 if x == -1:
 break
 print x
 n = x + 1

The second analyzes whether or not there is an close match within a parent string and returns True if there is, and false if there isn't. Again a close match is an exact match or one that differs by only one character.

import string
def similarstrings():
 print "This program will determine whether two strings differ"
 print "by more than one character. It will return True when they"
 print "are the same or differ by one character; otherwise it will"
 print "return False"
 str1 = raw_input("Enter first string:")
 str2 = raw_input("Enter second string:")
 x = 0
 for i in range(len(str1)):
 if str1[i] == str2[i]:
 x = x + 1
 else:
 x = x
 if x >= len(str1) - 1:
 print True
 else:
 print False

Any suggestions on how to combine the two would be much appreciated.

Question 2

Firstly, a quick review of what you've done:

import string is unnecessary really. You utilize string.find, but you can simply call the find method on your first argument, that is, seq.find(sub, n, len(seq)).

In your second method, you have:

if str1[i] == str2[i]:
 x = x + 1
else:
 x = x

The else statement is completely redundant here, you don't need it.

I think we can simplify the logic quite a bit here. I'm going to ignore raw_input and take the strings as arguments instead, but it doesn't change the logic (and is arguably better style anyway). Note that this is not fantastically optimized, it can definitely be made faster:

def similar_strings(s1, s2):
 '''Fuzzy matches s1 with s2, where fuzzy match is defined as
 either all of s2 being contained somewhere within s1, or
 a difference of only one character. Note if len(s2) == 1, then
 the character represented by s2 must be found within s1 for a match
 to occur.
 Returns a list with the starting indices of all matches.
 '''
 block_size = len(s2)
 match = []
 for j in range(0, len(s1) - block_size):
 diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)]
 if diff.count(0) and diff.count(0) >= (block_size - 1):
 match.append(j)
 if not match:
 return None
 return match

How does this work? Well, it basically creates a "window" of size len(s2) which shifts along the original string (s1). Within this window, we take a slice of our original string (s1[j:j+block_size]) which is of length len(s2). zip creates a list of tuples, so, for example, zip(['a','b','c'], ['x','y','z']) will give us back [(a, x), (b, y), (c, z)].

Then, we make a difference of all the characters from our s2 sized block of s1 and our actual s2 - ord gives us the (integer) value of a character. So the rather tricky line diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)] will give us a list of size len(s2) which has all the differences between characters. As an example, with s1 being abcdefefefef and s2 being cde, our first run through will give us the slice abc to compare with cde - which gives us [-2, -2, -2] since all characters are a distance of 2 apart.

So, given diff, we want to see how many 0's it has, as two characters will be the same when their ord difference is 0. However, we need to make sure there is at least 1 character the same in our block (if you simply pass in z, or any string length 1, diff will always have length 1, and block_size - 1 will be 0. Hence diff.count(0) >= 0 will always be True, which is not what we want). Thus diff.count(0) will only be true if it is greater than 0, which protects us from this. When that's true, look at the block size, make sure it is at least len(s2) - 1, and if so, add that index to match.

Hopefully this gives you an idea of how you'd do this, and I hope the explanation is detailed enough to make sense of some python constructs you may not have seen before.

Yuushi Yuushi 11.1k2 gold badges31 silver badges66 bronze badges · Answer 1 · 2013-02-26 02:41:03Z

Firstly, a quick review of what you've done:

import string is unnecessary really. You utilize string.find, but you can simply call the find method on your first argument, that is, seq.find(sub, n, len(seq)).

In your second method, you have:

if str1[i] == str2[i]:
 x = x + 1
else:
 x = x

The else statement is completely redundant here, you don't need it.

I think we can simplify the logic quite a bit here. I'm going to ignore raw_input and take the strings as arguments instead, but it doesn't change the logic (and is arguably better style anyway). Note that this is not fantastically optimized, it can definitely be made faster:

def similar_strings(s1, s2):
 '''Fuzzy matches s1 with s2, where fuzzy match is defined as
 either all of s2 being contained somewhere within s1, or
 a difference of only one character. Note if len(s2) == 1, then
 the character represented by s2 must be found within s1 for a match
 to occur.
 Returns a list with the starting indices of all matches.
 '''
 block_size = len(s2)
 match = []
 for j in range(0, len(s1) - block_size):
 diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)]
 if diff.count(0) and diff.count(0) >= (block_size - 1):
 match.append(j)
 if not match:
 return None
 return match

How does this work? Well, it basically creates a "window" of size len(s2) which shifts along the original string (s1). Within this window, we take a slice of our original string (s1[j:j+block_size]) which is of length len(s2). zip creates a list of tuples, so, for example, zip(['a','b','c'], ['x','y','z']) will give us back [(a, x), (b, y), (c, z)].

Then, we make a difference of all the characters from our s2 sized block of s1 and our actual s2 - ord gives us the (integer) value of a character. So the rather tricky line diff = [ord(i) - ord(j) for (i, j) in zip(s1[j:+block_size],s2)] will give us a list of size len(s2) which has all the differences between characters. As an example, with s1 being abcdefefefef and s2 being cde, our first run through will give us the slice abc to compare with cde - which gives us [-2, -2, -2] since all characters are a distance of 2 apart.

So, given diff, we want to see how many 0's it has, as two characters will be the same when their ord difference is 0. However, we need to make sure there is at least 1 character the same in our block (if you simply pass in z, or any string length 1, diff will always have length 1, and block_size - 1 will be 0. Hence diff.count(0) >= 0 will always be True, which is not what we want). Thus diff.count(0) will only be true if it is greater than 0, which protects us from this. When that's true, look at the block size, make sure it is at least len(s2) - 1, and if so, add that index to match.

Hopefully this gives you an idea of how you'd do this, and I hope the explanation is detailed enough to make sense of some python constructs you may not have seen before.

Stack Exchange Network

Python: Combing two programs that analyze strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python: Combing two programs that analyze strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions