Efficient multiple substrings search

Question 1

I have many substrings(2-5 words each) which I would like to search in some text of about 40-50 words length. What is the most efficient way to flag matching substrings.

Currently I am simply using:

for substring in substrings:
 if substring in fullText:
 return True

substrings - the list of strings to be searched

fullText - text to be searched on.

Worst case for this solution is all substrings are searched on fullText repeatedly.

Question 2

Create a regular expression from your list (something like "word1|word2|word3") and use the regular expression functions available for your language. It will hopefully create a data structure optimized for matching, maybe a finite state machine or something equivalent.

Question 3

There's no "hopefully". Regular expressions are invariably converted to finite state machines.

Question 4

I've seen incredibly naive and stupid solutions to problems where known good algorithms exist, so I'm extra careful with words, but of course you are correct.

Question 5

For python on the test cases I ran, regular expression match worked slower compared to simple string search. Also wanted to mention, I am trying to search sentences within a sentence.

Question 6

Compiling the regular expression consumes some time, so your results will depend on the number of patterns and the number of files. A simple string search might be faster for a single file, especially when you have an early match. To really evaluate the relative performance, measure realistic test cases. If your initial nested loop is fastest, you obviously don't need to optimize.

score 5 · Answer 1 · 2018-04-15 08:40:23Z

5

Create a regular expression from your list (something like "word1|word2|word3") and use the regular expression functions available for your language. It will hopefully create a data structure optimized for matching, maybe a finite state machine or something equivalent.

Share

Improve this answer

answered Apr 15, 2018 at 8:40

Hans-Martin Mosner's user avatar

Hans-Martin Mosner Hans-Martin Mosner

18.6k1 gold badge37 silver badges48 bronze badges

4

1

There's no "hopefully". Regular expressions are invariably converted to finite state machines.

kevin cline
– kevin cline

2018年04月15日 08:46:37 +00:00
Commented Apr 15, 2018 at 8:46
I've seen incredibly naive and stupid solutions to problems where known good algorithms exist, so I'm extra careful with words, but of course you are correct.

Hans-Martin Mosner
– Hans-Martin Mosner

2018年04月15日 09:38:36 +00:00
Commented Apr 15, 2018 at 9:38
For python on the test cases I ran, regular expression match worked slower compared to simple string search. Also wanted to mention, I am trying to search sentences within a sentence.

skadoosh
– skadoosh

2018年04月17日 11:50:25 +00:00
Commented Apr 17, 2018 at 11:50
1

Compiling the regular expression consumes some time, so your results will depend on the number of patterns and the number of files. A simple string search might be faster for a single file, especially when you have an early match. To really evaluate the relative performance, measure realistic test cases. If your initial nested loop is fastest, you obviously don't need to optimize.

Hans-Martin Mosner
– Hans-Martin Mosner

2018年04月17日 12:29:57 +00:00
Commented Apr 17, 2018 at 12:29

Add a comment |

Stack Exchange Network

Efficient multiple substrings search

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient multiple substrings search

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions