I have a big collection of strings a[1], ..., a[N] where N is about several millions. I am provided a string m and I need to iterate over all the strings a[i] that contain m. In other words, I need to find all the strings a[i] having m as substring.
What would be the most efficient data structure for that problem? I need to be very careful with memory as I am working with a big number of strings.
-
Is this related to specific programming language or is it acceptable to use databases?Marek Sebera– Marek Sebera2013年11月30日 23:01:11 +00:00Commented Nov 30, 2013 at 23:01
-
you want a string searching algo? check boyer-moore and variantsratchet freak– ratchet freak2013年11月30日 23:11:37 +00:00Commented Nov 30, 2013 at 23:11
-
@MarekSebera database is acceptableIlya Gazman– Ilya Gazman2013年12月01日 07:45:53 +00:00Commented Dec 1, 2013 at 7:45
1 Answer 1
if you skip n
letters then you just need to check if the letter is in the first n
of the input string and then check back again to see if it matches
this means creating a data structure (just a 256 long array with information about where the string might start) for the input string but allows the other collection of strings to remain unordered
also check out this blog post, and the boyer moore algorithm