I have many substrings(2-5 words each) which I would like to search in some text of about 40-50 words length. What is the most efficient way to flag matching substrings.
Currently I am simply using:
for substring in substrings:
if substring in fullText:
return True
substrings - the list of strings to be searched
fullText - text to be searched on.
Worst case for this solution is all substrings are searched on fullText repeatedly.
1 Answer 1
Create a regular expression from your list (something like "word1|word2|word3") and use the regular expression functions available for your language. It will hopefully create a data structure optimized for matching, maybe a finite state machine or something equivalent.
-
1There's no "hopefully". Regular expressions are invariably converted to finite state machines.kevin cline– kevin cline2018年04月15日 08:46:37 +00:00Commented Apr 15, 2018 at 8:46
-
I've seen incredibly naive and stupid solutions to problems where known good algorithms exist, so I'm extra careful with words, but of course you are correct.Hans-Martin Mosner– Hans-Martin Mosner2018年04月15日 09:38:36 +00:00Commented Apr 15, 2018 at 9:38
-
For python on the test cases I ran, regular expression match worked slower compared to simple string search. Also wanted to mention, I am trying to search sentences within a sentence.skadoosh– skadoosh2018年04月17日 11:50:25 +00:00Commented Apr 17, 2018 at 11:50
-
1Compiling the regular expression consumes some time, so your results will depend on the number of patterns and the number of files. A simple string search might be faster for a single file, especially when you have an early match. To really evaluate the relative performance, measure realistic test cases. If your initial nested loop is fastest, you obviously don't need to optimize.Hans-Martin Mosner– Hans-Martin Mosner2018年04月17日 12:29:57 +00:00Commented Apr 17, 2018 at 12:29