My problem starts with a list A (length around n = 100) of big strings (around 10000 characters each). I also have another q = 10000 strings of length 100. I want to check if each string is a substring of any element of the list A.
I've tried to do this using inor any but it is still taking too much time as there are 10000 iterations and in each iteration I'm checking if s of length 100 is in str of length 10000.
n,q=[int(item) for item in input().split()]
desc=[]
for i in range(n):
desc.append(input())
desc="\t".join(desc)
for j in range(q):
quest=input().strip()
if quest in desc:
print("It's in !")
else:
print("It's not in ..")
Is there any better way to do this much faster?
Note: The numbers I'm expliciting are upper bounds not exact values of the lengths.
1 Answer 1
The problem of finding matches for multiple fixed search strings in a corpus is solved by the Aho–Corasick algorithm in time proportional to the length of the corpus plus the number of matches.
Python doesn't come with an implementation of the Aho–Corasick algorithm (as far as I know), but the Python Package Index has the pyahocorasick package. Or you could write your own.
Alternatively, if you are on a Unix system, you can use the -F
(fixed strings) option to grep
and avoid Python altogether.
Explore related questions
See similar questions with these tags.
quest
always going to be exactly 100 characters? \$\endgroup\$