Python timing optimisation

Question 1

I'm new in this community and with Python 3!

I'm trying to optimize this exercise:

We have n strings stored in a txt file. Inside the strings a secret word is hidden as a substring of consecutive characters.

We know with certainty that the word repeats itself exactly once in each string but we don't know where.

Of the word we know the length M and we know that there are no other substrings of length M which are repeated only once in all strings.

We want to know for each string the position where the first character appears of the hidden word.

For example for the 3 strings with hidden word of length 3:
dogrtre
sdfddoge
bcb98xdoge42
the hidden word is 'dog' and the positions are in the order: [0, 4, 6]

The information in the text file is organized as follows:

the first line contains the length of the hidden word (an integer).

then follow the character strings, each string occupies one or more consecutive lines of the file and is separated from the following string by an empty line.

Each line of the file ends with a newline.

This is my code now, it works, but I need to speed up the process of searching the hidden word...

The file.txt is:

3
dogrtre
sdfdd
oge
bc
b98xdo
ge42

As you can see the string in the file can stay on different lines. Can you help me?

#n=the lenght of the hidden word
#F=TextFile
#lista1=list of the possible hidden word of lenght n, the substring of the 
#first string
#k=number of times that the word is in the file 
#pos=list of position
def es1(ftesto):
 def conta(F): #this function count the strings in the file
 F.seek(0)
 a=1
 l=F.readline()
 while l!='':
 if l=='\n':
 a+=1
 l=F.readline()
 return a 
 def crea_chiave(F): #this function read and split the first of the file 
 s='' #in substring of lenght n
 l=F.readline() 
 while l!='\n':
 s=s+l[:len(l)-1]
 l=F.readline() 
 for i in range(0,len(s)-(n-1)):
 if s[i:n+i] not in lista1: #it also control if there is 
 lista1.append(s[i:n+i]) #already the same substring in the
 else: #string
 lista1.remove(s[i:n+i]) #and in this case the function remove 
 #it from the list
 def crea_lista(F,a,lista): #this function create the list 
 s=''
 l=F.readline()
 while l!='':
 s=spezza(l,s,F)
 if s not in lista: #if the string is already in the list,it goes
 lista.append(s) #to the next one
 else:
 break
 l=F.readline()
 s=''
 def spezza(l,s,F): #this function build the string from the file 
 while l!='\n' and l!='':
 s=s+l[:len(l)-1]
 l=F.readline()
 return s 
 def controlla_parole(i,lista,k):#this function count how many times
 for y in lista: #times the word repeat itself in each strings 
 if y.count(i)!=1: #if the word repeats itself 2 or 0 times 
 break #in the string it isn't the hidden word 
 else:
 k+=1 
 return k 
 def trova_chiave(F,lista1,n): #this function find the hidden word
 a=conta(open(ftesto,encoding='utf8'))
 lista=[]
 crea_lista(F,a,lista)
 k=0
 for i in lista1: 
 k=controlla_parole(i,lista,k)
 if a==k: # if the number of hidden word repeats
 break #itself in the file is the same of 
 k=0 #the number of strings in a file it is 
 return i #our hidden word
 def controllo(F,l,chiave):# finally this func ,once we have the hidden 
 pos=[] #word,return the position of hiddenword in the 
 s='' #strings of file
 while l!='':
 while l!='\n' and l!='':
 s=s+l[:len(l)-1]
 l=F.readline() 
 pos.append(s.find(chiave))
 s=''
 l=F.readline() 
 return pos
 F=open(ftesto,encoding='utf8')
 n=int(F.readline())
 F.seek(len(str(n))+1)
 lista1=[]
 crea_chiave(F)
 F.seek(len(str(n))+1)
 chiave=trova_chiave(F,lista1,n) 
 F.seek(len(str(n))+1) 
 return controllo(F,F.readline(),chiave)

With the given file the function does returns [0, 4, 6].

Can you help me to optimize and speed up this code? Thanks!!

Question 2

Will different words always be separated by a newline in the text file?

Question 3

What if each word would have multiple common substrings of same length? Let's say, besides of dog - each string contains cbd substring. How it should handle multiple common substrings with same length per each string?

Question 4

For the first question yes!

Question 5

For the second we know that there is only one word with this features

Question 6

I think crea_chiave() may not work correctly if a s[i:n+i] is in the text more than 2 times (1st time it is added to lista1. 2nd time it is removed. 3rd time it is added again.).

Question 7

Type annotations and more descriptive variable names (from my perspective, preferably in English, heh) would make it easier for others to navigate the code and make specific suggestions, but here's how I'd suggest splitting the task up:

Have a function that will read the file and return a List[str] of all the strings (i.e. split the file on blank lines and remove all the newlines). If memory ends up being a limiting factor, you'd want to implement this as a generator instead (and you'll end up potentially doing two passes through the file), but if you can hold the whole file in memory then it's easier to just read it once and hold the whole thing as a list.
Have a function that finds all the words of length M in a particular string that repeat exactly once (i.e. it returns a set of words). You can maybe do this via some tricky regexing, or just brute-force trying every M-slice of the string and looking for repeat occurrences in the remainder of the string. This will be relatively slow (the brute force method will be roughly O(n^2), a regex might be faster), but you won't have to call it very many times, hopefully.
Go through the list and start using that repeating-subset-finding function. Each time you get the set of words from a substring, intersect it with your set. Once you have one (and this might happen on the first string you check), that's your secret word! You finished the hard part.
Now that you know the secret word, the rest is easy. Go through the list and use the builtin find function to find the indices you're after.
If you want to optimize further (I ended up implementing this below), do two types of search: first get a list of candidates from the shortest input string, and then go through all the other inputs and verify that each of those occurs once, eliminating bad candidates as you go. The number of find() operations you need to do will shrink as you eliminate candidates, and once you're down to one you're finished.

from typing import List, Set, Tuple
def get_secret_word(input_file: str) -> Tuple[str, List[int]]:
 """Returns the secret word and a list of its indices."""
 def load_file(input_file: str) -> Tuple[int, List[str]]:
 """Returns length of the needle and all the haystacks."""
 with open(input_file) as file:
 length = int(file.readline())
 strings = [""]
 line = file.readline()
 while line:
 line = line.strip()
 if line:
 strings[-1] += line
 else:
 strings.append("")
 line = file.readline()
 return length, strings
 def generate_words(length: int, haystack: str) -> Set[str]:
 """Returns words of this length that occur in this haystack."""
 return {haystack[i:i+length] for i in range(len(haystack)-length+1)}
 def find_known_words(needles: Set[str], haystack: str) -> Set[str]:
 """Search the haystack for this set of needles,
 and return just the needles that were found once."""
 found = set()
 for needle in needles:
 i = haystack.find(needle)
 if i == -1: # zero occurrences
 continue
 i = haystack.find(needle, i + len(needle))
 if i == -1: # exactly one occurrence
 found.add(needle) 
 return found
 # Load up the file.
 length, haystacks = load_file(input_file)
 # Build an initial set of needles from the smallest haystack.
 needles = generate_words(length, min(haystacks, key=len))
 for haystack in haystacks:
 # Our secret word is the one that is unique
 # in ALL haystacks. Process of elimination...
 if len(needles) == 1:
 break
 needles &= find_known_words(needles, haystack)
 assert len(needles) == 1, "No secret word found!"
 secret = needles.pop()
 # Return the secret word and a list of its indices in the input strings.
 return secret, [haystack.find(secret) for haystack in haystacks]

Question 8

Thank you very much for the answer, yes certainly I can try in this way. Would I still maintain greater efficiency knowing that in the file I can have equal strings? Because in my code I foresee this possibility.

Question 9

It depends on how malicious the input file is in trying to make the secret word hard to find -- the worst case scenario would be one where every string has the same two repeating substrings (so that either one of them could be the secret word) until you get to the very end. :) If you wanted to optimize that case, your second function could accept the set of candidates as an argument, and use that to speed up its search a bit (i.e. it doesn't need to waste time testing any substrings that aren't already candidates).

Question 10

Ah, re-reading the problem I realize that when you say "repeats once" you mean "occurs once" (as opposed to occurs twice, i.e. repeated). Yes, you'll definitely need to keep a set and pare it down!

Question 11

I went ahead and wrote up an implementation to make sure my strategy made sense, adding it to my answer. :)

Question 12

I think the way to implement the optimization I mentioned would be to write another function that searches the haystack for a set of needles (returning those that were found), and then use that for every haystack after the first. Another optimization after that would be to start with the smallest haystack (rather than the one that's first in the file), since that will start you with the smallest initial set of needles to search for.

Samwise Samwise 4,0107 silver badges14 bronze badges · Accepted Answer · 2019-11-16 22:09:07Z

Type annotations and more descriptive variable names (from my perspective, preferably in English, heh) would make it easier for others to navigate the code and make specific suggestions, but here's how I'd suggest splitting the task up:

Have a function that will read the file and return a List[str] of all the strings (i.e. split the file on blank lines and remove all the newlines). If memory ends up being a limiting factor, you'd want to implement this as a generator instead (and you'll end up potentially doing two passes through the file), but if you can hold the whole file in memory then it's easier to just read it once and hold the whole thing as a list.
Have a function that finds all the words of length M in a particular string that repeat exactly once (i.e. it returns a set of words). You can maybe do this via some tricky regexing, or just brute-force trying every M-slice of the string and looking for repeat occurrences in the remainder of the string. This will be relatively slow (the brute force method will be roughly O(n^2), a regex might be faster), but you won't have to call it very many times, hopefully.
Go through the list and start using that repeating-subset-finding function. Each time you get the set of words from a substring, intersect it with your set. Once you have one (and this might happen on the first string you check), that's your secret word! You finished the hard part.
Now that you know the secret word, the rest is easy. Go through the list and use the builtin find function to find the indices you're after.
If you want to optimize further (I ended up implementing this below), do two types of search: first get a list of candidates from the shortest input string, and then go through all the other inputs and verify that each of those occurs once, eliminating bad candidates as you go. The number of find() operations you need to do will shrink as you eliminate candidates, and once you're down to one you're finished.

from typing import List, Set, Tuple
def get_secret_word(input_file: str) -> Tuple[str, List[int]]:
 """Returns the secret word and a list of its indices."""
 def load_file(input_file: str) -> Tuple[int, List[str]]:
 """Returns length of the needle and all the haystacks."""
 with open(input_file) as file:
 length = int(file.readline())
 strings = [""]
 line = file.readline()
 while line:
 line = line.strip()
 if line:
 strings[-1] += line
 else:
 strings.append("")
 line = file.readline()
 return length, strings
 def generate_words(length: int, haystack: str) -> Set[str]:
 """Returns words of this length that occur in this haystack."""
 return {haystack[i:i+length] for i in range(len(haystack)-length+1)}
 def find_known_words(needles: Set[str], haystack: str) -> Set[str]:
 """Search the haystack for this set of needles,
 and return just the needles that were found once."""
 found = set()
 for needle in needles:
 i = haystack.find(needle)
 if i == -1: # zero occurrences
 continue
 i = haystack.find(needle, i + len(needle))
 if i == -1: # exactly one occurrence
 found.add(needle) 
 return found
 # Load up the file.
 length, haystacks = load_file(input_file)
 # Build an initial set of needles from the smallest haystack.
 needles = generate_words(length, min(haystacks, key=len))
 for haystack in haystacks:
 # Our secret word is the one that is unique
 # in ALL haystacks. Process of elimination...
 if len(needles) == 1:
 break
 needles &= find_known_words(needles, haystack)
 assert len(needles) == 1, "No secret word found!"
 secret = needles.pop()
 # Return the secret word and a list of its indices in the input strings.
 return secret, [haystack.find(secret) for haystack in haystacks]

Thank you very much for the answer, yes certainly I can try in this way. Would I still maintain greater efficiency knowing that in the file I can have equal strings? Because in my code I foresee this possibility.
It depends on how malicious the input file is in trying to make the secret word hard to find -- the worst case scenario would be one where every string has the same two repeating substrings (so that either one of them could be the secret word) until you get to the very end. :) If you wanted to optimize that case, your second function could accept the set of candidates as an argument, and use that to speed up its search a bit (i.e. it doesn't need to waste time testing any substrings that aren't already candidates).
Ah, re-reading the problem I realize that when you say "repeats once" you mean "occurs once" (as opposed to occurs twice, i.e. repeated). Yes, you'll definitely need to keep a set and pare it down!
I went ahead and wrote up an implementation to make sure my strategy made sense, adding it to my answer. :)
I think the way to implement the optimization I mentioned would be to write another function that searches the haystack for a set of needles (returning those that were found), and then use that for every haystack after the first. Another optimization after that would be to start with the smallest haystack (rather than the one that's first in the file), since that will start you with the smallest initial set of needles to search for.

Stack Exchange Network

Python timing optimisation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Python timing optimisation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions