I am trying to use the suffix array, and the LCP array to count all distinct substrings of a specified length.
I started with the algorithm for counting ALL distinct substrings. I solved it after this explanation: https://www.geeksforgeeks.org/count-distinct-substrings-string-using-suffix-array/ . The problem right now is that I can't figure out how to only count the substrings of my given length, and not all possibilities.
I also computed the array which tells me how many suffixes are lexicographically smaller than suffix_array[i].
Any ideas what I could use to do it?
1 Answer 1
The geeksforgeeks solution gives an efficient algorithm that accomplishes the following:
- Sorts the suffixes of the input string in lexicographic order.
- For every two adjacent suffixes in this order, finds the longest common prefix.
If you are after the number of distinct substrings of length $\ell$, you should proceed as follows:
- If the first suffix has length at least $\ell$, initialize your counter with 1, otherwise with 0.
- Now go over all the suffixes in lexicographic order. If the new suffix has length at least $\ell$ and the lcp with the preceding suffix has length smaller than $\ell$, then increment the counter.
For example, suppose that the string is ababa
, and you're after substrings of length $\ell = 3$. The ordered suffixes are a
, aba
, ababa
, ba
, baba
. The algorithm proceeds as follows:
- The first suffix
a
is shorter than $\ell$, so the counter starts at 0. - The second suffix
aba
has length at least $\ell$, and the lcp with the preceding suffix has length less than $\ell$, so increment the counter. (We have discoveredaba
.) - The third suffix
ababa
has length at least $\ell$, but the lcp with the preceding suffix also has length at least $\ell$. - The fourth suffix
ba
has length less than $\ell$. - The fifth suffix
baba
has length at least $\ell$, and the lcp with the preceding suffix has length less than $\ell$, so increment the counter. (We have discoveredbab
.)