I'm new in this community and with Python 3!
I'm trying to optimize this exercise:
We have
n
strings stored in a txt file. Inside the strings a secret word is hidden as a substring of consecutive characters.We know with certainty that the word repeats itself exactly once in each string but we don't know where.
Of the word we know the length
M
and we know that there are no other substrings of lengthM
which are repeated only once in all strings.We want to know for each string the position where the first character appears of the hidden word.
For example for the 3 strings with hidden word of length 3:
dogrtre sdfddoge bcb98xdoge42
the hidden word is 'dog' and the positions are in the order:
[0, 4, 6]
The information in the text file is organized as follows:
the first line contains the length of the hidden word (an integer).
then follow the character strings, each string occupies one or more consecutive lines of the file and is separated from the following string by an empty line.
Each line of the file ends with a newline.
This is my code now, it works, but I need to speed up the process of searching the hidden word...
The file.txt is:
3
dogrtre
sdfdd
oge
bc
b98xdo
ge42
As you can see the string in the file can stay on different lines. Can you help me?
#n=the lenght of the hidden word
#F=TextFile
#lista1=list of the possible hidden word of lenght n, the substring of the
#first string
#k=number of times that the word is in the file
#pos=list of position
def es1(ftesto):
def conta(F): #this function count the strings in the file
F.seek(0)
a=1
l=F.readline()
while l!='':
if l=='\n':
a+=1
l=F.readline()
return a
def crea_chiave(F): #this function read and split the first of the file
s='' #in substring of lenght n
l=F.readline()
while l!='\n':
s=s+l[:len(l)-1]
l=F.readline()
for i in range(0,len(s)-(n-1)):
if s[i:n+i] not in lista1: #it also control if there is
lista1.append(s[i:n+i]) #already the same substring in the
else: #string
lista1.remove(s[i:n+i]) #and in this case the function remove
#it from the list
def crea_lista(F,a,lista): #this function create the list
s=''
l=F.readline()
while l!='':
s=spezza(l,s,F)
if s not in lista: #if the string is already in the list,it goes
lista.append(s) #to the next one
else:
break
l=F.readline()
s=''
def spezza(l,s,F): #this function build the string from the file
while l!='\n' and l!='':
s=s+l[:len(l)-1]
l=F.readline()
return s
def controlla_parole(i,lista,k):#this function count how many times
for y in lista: #times the word repeat itself in each strings
if y.count(i)!=1: #if the word repeats itself 2 or 0 times
break #in the string it isn't the hidden word
else:
k+=1
return k
def trova_chiave(F,lista1,n): #this function find the hidden word
a=conta(open(ftesto,encoding='utf8'))
lista=[]
crea_lista(F,a,lista)
k=0
for i in lista1:
k=controlla_parole(i,lista,k)
if a==k: # if the number of hidden word repeats
break #itself in the file is the same of
k=0 #the number of strings in a file it is
return i #our hidden word
def controllo(F,l,chiave):# finally this func ,once we have the hidden
pos=[] #word,return the position of hiddenword in the
s='' #strings of file
while l!='':
while l!='\n' and l!='':
s=s+l[:len(l)-1]
l=F.readline()
pos.append(s.find(chiave))
s=''
l=F.readline()
return pos
F=open(ftesto,encoding='utf8')
n=int(F.readline())
F.seek(len(str(n))+1)
lista1=[]
crea_chiave(F)
F.seek(len(str(n))+1)
chiave=trova_chiave(F,lista1,n)
F.seek(len(str(n))+1)
return controllo(F,F.readline(),chiave)
With the given file the function does returns [0, 4, 6]
.
Can you help me to optimize and speed up this code? Thanks!!
1 Answer 1
Type annotations and more descriptive variable names (from my perspective, preferably in English, heh) would make it easier for others to navigate the code and make specific suggestions, but here's how I'd suggest splitting the task up:
Have a function that will read the file and return a
List[str]
of all the strings (i.e. split the file on blank lines and remove all the newlines). If memory ends up being a limiting factor, you'd want to implement this as a generator instead (and you'll end up potentially doing two passes through the file), but if you can hold the whole file in memory then it's easier to just read it once and hold the whole thing as a list.Have a function that finds all the words of length M in a particular string that repeat exactly once (i.e. it returns a set of words). You can maybe do this via some tricky regexing, or just brute-force trying every M-slice of the string and looking for repeat occurrences in the remainder of the string. This will be relatively slow (the brute force method will be roughly O(n^2), a regex might be faster), but you won't have to call it very many times, hopefully.
Go through the list and start using that repeating-subset-finding function. Each time you get the set of words from a substring, intersect it with your set. Once you have one (and this might happen on the first string you check), that's your secret word! You finished the hard part.
Now that you know the secret word, the rest is easy. Go through the list and use the builtin
find
function to find the indices you're after.If you want to optimize further (I ended up implementing this below), do two types of search: first get a list of candidates from the shortest input string, and then go through all the other inputs and verify that each of those occurs once, eliminating bad candidates as you go. The number of
find()
operations you need to do will shrink as you eliminate candidates, and once you're down to one you're finished.
from typing import List, Set, Tuple
def get_secret_word(input_file: str) -> Tuple[str, List[int]]:
"""Returns the secret word and a list of its indices."""
def load_file(input_file: str) -> Tuple[int, List[str]]:
"""Returns length of the needle and all the haystacks."""
with open(input_file) as file:
length = int(file.readline())
strings = [""]
line = file.readline()
while line:
line = line.strip()
if line:
strings[-1] += line
else:
strings.append("")
line = file.readline()
return length, strings
def generate_words(length: int, haystack: str) -> Set[str]:
"""Returns words of this length that occur in this haystack."""
return {haystack[i:i+length] for i in range(len(haystack)-length+1)}
def find_known_words(needles: Set[str], haystack: str) -> Set[str]:
"""Search the haystack for this set of needles,
and return just the needles that were found once."""
found = set()
for needle in needles:
i = haystack.find(needle)
if i == -1: # zero occurrences
continue
i = haystack.find(needle, i + len(needle))
if i == -1: # exactly one occurrence
found.add(needle)
return found
# Load up the file.
length, haystacks = load_file(input_file)
# Build an initial set of needles from the smallest haystack.
needles = generate_words(length, min(haystacks, key=len))
for haystack in haystacks:
# Our secret word is the one that is unique
# in ALL haystacks. Process of elimination...
if len(needles) == 1:
break
needles &= find_known_words(needles, haystack)
assert len(needles) == 1, "No secret word found!"
secret = needles.pop()
# Return the secret word and a list of its indices in the input strings.
return secret, [haystack.find(secret) for haystack in haystacks]
-
\$\begingroup\$ Thank you very much for the answer, yes certainly I can try in this way. Would I still maintain greater efficiency knowing that in the file I can have equal strings? Because in my code I foresee this possibility. \$\endgroup\$Lorenzo Arcioni– Lorenzo Arcioni2019年11月16日 22:22:24 +00:00Commented Nov 16, 2019 at 22:22
-
\$\begingroup\$ It depends on how malicious the input file is in trying to make the secret word hard to find -- the worst case scenario would be one where every string has the same two repeating substrings (so that either one of them could be the secret word) until you get to the very end. :) If you wanted to optimize that case, your second function could accept the set of candidates as an argument, and use that to speed up its search a bit (i.e. it doesn't need to waste time testing any substrings that aren't already candidates). \$\endgroup\$Samwise– Samwise2019年11月16日 23:04:15 +00:00Commented Nov 16, 2019 at 23:04
-
\$\begingroup\$ Ah, re-reading the problem I realize that when you say "repeats once" you mean "occurs once" (as opposed to occurs twice, i.e. repeated). Yes, you'll definitely need to keep a set and pare it down! \$\endgroup\$Samwise– Samwise2019年11月17日 01:26:23 +00:00Commented Nov 17, 2019 at 1:26
-
\$\begingroup\$ I went ahead and wrote up an implementation to make sure my strategy made sense, adding it to my answer. :) \$\endgroup\$Samwise– Samwise2019年11月17日 01:33:42 +00:00Commented Nov 17, 2019 at 1:33
-
\$\begingroup\$ I think the way to implement the optimization I mentioned would be to write another function that searches the haystack for a set of needles (returning those that were found), and then use that for every haystack after the first. Another optimization after that would be to start with the smallest haystack (rather than the one that's first in the file), since that will start you with the smallest initial set of needles to search for. \$\endgroup\$Samwise– Samwise2019年11月17日 01:42:15 +00:00Commented Nov 17, 2019 at 1:42
dog
- each string containscbd
substring. How it should handle multiple common substrings with same length per each string? \$\endgroup\$crea_chiave()
may not work correctly if as[i:n+i]
is in the text more than 2 times (1st time it is added to lista1. 2nd time it is removed. 3rd time it is added again.). \$\endgroup\$