I have many substrings(2-5 words each) which I would like to search in some text of about 40-50 words length. What is the most efficient way to flag matching substrings.
Currently I am simply using:
for substring in substrings:
if substring in fullText:
return True
substrings - the list of strings to be searched
fullText - text to be searched on.
Worst case for this solution is all substrings are searched on fullText repeatedly.
1 Answer 1
Create a regular expression from your list (something like "word1|word2|word3") and use the regular expression functions available for your language. It will hopefully create a data structure optimized for matching, maybe a finite state machine or something equivalent.
-
1There's no "hopefully". Regular expressions are invariably converted to finite state machines.kevin cline– kevin cline04/15/2018 08:46:37Commented Apr 15, 2018 at 8:46
-
I've seen incredibly naive and stupid solutions to problems where known good algorithms exist, so I'm extra careful with words, but of course you are correct.Hans-Martin Mosner– Hans-Martin Mosner04/15/2018 09:38:36Commented Apr 15, 2018 at 9:38
-
For python on the test cases I ran, regular expression match worked slower compared to simple string search. Also wanted to mention, I am trying to search sentences within a sentence.skadoosh– skadoosh04/17/2018 11:50:25Commented Apr 17, 2018 at 11:50
-
1Compiling the regular expression consumes some time, so your results will depend on the number of patterns and the number of files. A simple string search might be faster for a single file, especially when you have an early match. To really evaluate the relative performance, measure realistic test cases. If your initial nested loop is fastest, you obviously don't need to optimize.Hans-Martin Mosner– Hans-Martin Mosner04/17/2018 12:29:57Commented Apr 17, 2018 at 12:29