Fast search for the longest repeating substrings in large text (Rev.2)

Question 1

This is the second iteration of the Fast search for the longest repeating substrings in large text code review. Special thanks goes to G. Sliepen who conducted the first review.

Functional specification

Having a large text and already built suffix array find the longest substrings (limited by the number of repetitions) getting benefit from already existing prefix array, if reasonable. Provide the version which would benefit from multithreading keeping the code as simple as possible.

(*One thing I am still in two minds about here in the spec. is should the function return the longest substring which repeats the exact number of times or "this number and more". The problem with the more precise former is that the line which repeated N times may be shorter than line repeated N+1 and this could be not what the user want to see. Again provide options? It will overcomplicate the solution.)

Implementation

The fully functional demo.

Reservation: current implementation utilizes execution policies, so using ranges/projections/views under the question; would work if multi-threading requirement from above met.

The further details make sense only as addition to aforementioned first Code Review with answer to it.

List of major changes

The interface was redesigned according to the proposal and a little bit even more. Since function could fine many different substrings which will repeat the specified number of times, it returns the vector of string_views with all entries in the text so that many resulting substrings with their positions in text could be returned.
Prefix array reworked to suffix array which allowed to use algorithms in forward direction.
Proposed version of atomic_fetch_max is used instead of update_max.
Defect related to returning duplicates, described in first review fixed. To make sure that function finds substring with exact number of repetitions border checks have been added in the calculateAdjacents functions.
Two versions of calculateAdjacents function provided just for code review to make sure I got correctly the way to implement with std::transform_reduce. They can be switched by macro at the beginning of the code. Please don’t try to remove duplication code because caused by this, only one function will survive. I will conduct performance testing on real data shortly.
Function findMatchLengthWithOffsetInSuffixArray extracted to avoid complex code in calculateAdjacents functions.

My concerns on the current code

To fix the defect when function finds substrings which are repeated not exactly requested number of times, but even more frequently I had to add border checks to ensure the number of repeats (see 2 extra calls to findMatchLengthWithOffsetInSuffixArray in calculateAdjacents functions and I hate this code most because of "almost duplication" of this code. Still looking for a way to collapse it in something shorter.

// Returns found sequence(s) length
size_t calculateAdjacentsByForEach(const std::string& text, 
 const std::vector<index_t>& index, 
 std::vector<index_t>& adjacent_matches, 
 size_t repetitions)
{
 std::fill(std::execution::par_unseq, adjacent_matches.begin(), adjacent_matches.end(), 0);
 std::atomic<size_t> max_match_length = 0;
 std::for_each(std::execution::par_unseq, index.begin(), index.end() - (repetitions - 1), [&](const index_t& item) {
 size_t idx = &item - std::data(index);
 // Meets exactly min_repetitions times
 size_t match_length = findMatchLengthWithOffsetInSuffixArray(text, index, idx, (repetitions - 1));
 // Not more than min_repetitions times before item
 if (&item > std::data(index)) {
 size_t prev_match_length = findMatchLengthWithOffsetInSuffixArray(text, index, idx, -1);
 if (prev_match_length == match_length)
 return;
 }
 // Not more than min_repetitions times after item
 if ((size_t)((&item + repetitions) - std::data(index)) < index.size()) {
 size_t next_match_length = findMatchLengthWithOffsetInSuffixArray(text, index, idx, repetitions);
 if (next_match_length == match_length)
 return;
 }
 adjacent_matches[idx] = (index_t)match_length;
 atomic_fetch_max(&max_match_length, match_length); 
 });
 return max_match_length;
}

I still dislike the max_repetitions threshold. It seems that it is better to limit by maximum and minimum length of found substrings, although this is not so straightforward, since by a new criterion line repeated (exactly) 3 times could be longer than line repeated 2 times, so it is hard to develop stop-criteria this way.

The inner loop in function collectRepeatingSubstrings is still hard to read.

struct SubstringOccurences {
 std::string_view str;
 std::vector<index_t> pos;
};
std::vector<SubstringOccurences>
collectRepeatingSubstrings(const std::string& text,
 const std::vector<index_t>& index,
 std::vector<index_t> adjacent_matches,
 size_t repetitions,
 size_t max_match_length)
{
 std::vector<SubstringOccurences> result;
 for (size_t i = 0; i < adjacent_matches.size() - 1; ++i) {
 if (adjacent_matches[i] == max_match_length) {
 SubstringOccurences occur;
 occur.str = std::string_view(text.data() + index[i], max_match_length);
 for (size_t match = i; match <= std::min(i + (repetitions - 1), adjacent_matches.size() - 1); ++match) {
 occur.pos.emplace_back(index[match]);
 }
 result.emplace_back(occur);
 }
 }
 return result;
}

I really hate the idea to pass 5 parameters to collectRepeatingSubstrings function despite its benefits. Too many and too way easy to mix parameters, especially when comes it last size_t variables.Well, text, index and adjacent_matches could be wrapped in one structure, but the naming and composing question is still open: The best place to store index to data
Four parameters passed to calculateAdjacents functions is not good, as well.
Having both findMatchLengthWithOffsetInSuffixArray and findMatchLength now seems to be redundant, but I don’t see a good way to get rid of one of them.

I would be happy to avoid this std::fill at the beginning of calculateAdjacents function, but I am not sure if this is possible.

size_t findMatchLength(std::string_view str1, std::string_view str2) {
 return std::ranges::mismatch(str1, str2).in1 - str1.begin();
}
size_t findMatchLengthWithOffsetInSuffixArray(const std::string& text,
 const std::vector<index_t>& index, 
 const size_t idx,
 ptrdiff_t offset)
{
 return findMatchLength(std::string_view(text.begin() + index[idx], text.end()),
 std::string_view(text.begin() + index[idx + offset], text.end()));
}

To put in a nutshell, despite the obvious fact that the code now has fixed defect and better decomposed, I still don’t like it because of the following reasons:

Too many helper functions in the scope of high-level functions. This is in my way as a client of this module (namespace) to understand what is the interface and what is the implementation (even when divided to h and cpp files).
Some code is still hard to read.
Functions have too many arguments which make it hard to call them. Seems like I am "cutting artificially something alive in a wrong place", leaving too many connections.
Utility functions are still not "generic" for future reuse. Out of context of this task it is hard to explain and document what for and how (with which preconditions, postconditions, etc.) they could be used.

Is there any way to improve this code? Any mistakes or issues? Could you please help conducting further steps of code review?

Question 2

It's always nice to see people improve their code after a review!

About your concerns

One thing I am still in two minds about here in the spec. is should the function return the longest substring which repeats the exact number of times or "this number and more".

What behavior does the user need? If the choice doesn't matter for some reason, then just pick one and go with it.

To fix the defect when function finds substrings which are repeated not exactly requested number of times, but even more frequently I had to add border checks to ensure the number of repeats (see 2 extra calls to findMatchLengthWithOffsetInSuffixArray in calculateAdjacents functions and I hate this code most because of "almost duplication" of this code. Still looking for a way to collapse it in something shorter.

If you go for the "this number or more" version, then you can remove one of the calls to findMatchLengthWithOffsetInSuffixArray(). But then there is another subtle defect: when it returns the longest equal sequence repeated at least N times, it will only collect exactly N positions of the repeated sequences, even if there are more.

I still dislike the max_repetitions threshold. It seems that it is better to limit by maximum and minimum length of found substrings, although this is not so straightforward, since by a new criterion line repeated (exactly) 3 times could be longer than line repeated 2 times, so it is hard to develop stop-criteria this way.

If you set max_repetitions to text.size() you will have a valid stop-criterion. This will cause a lot of work to be done for long inputs though.

A better criterion can be found by analyzing the suffix array, and finding the longest consecutive sequence of entries that have the same first character, and use the length of that sequence as max_repetitions.

The inner loop in function collectRepeatingSubstrings is still hard to read.

Let's look at that inner loop:

for (size_t match = i; match <= std::min(i + (repetitions - 1), adjacent_matches.size() - 1); ++match) {
 occur.pos.emplace_back(index[match]);
}

Why is the call to std::min() there? We already know we have at least repetitions entries starting at i. So we can remove that. Then we can let the loop go from 0 to repetitions:

for (size_t match = 0; match < repetitions; ++match) {
 occur.pos.emplace_back(index[i + match]);
}

Maybe now it's easier to see that we are just copying a slice from index into occur.pos. But your spider senses should be tingling at this point: surely the standard library has some way to copy a subrange from a vector into another vector? There are indeed various ways to do this. One of them is by using the overload of std::vector's constructor that takes two iterators. Now you don't need the inner loop at all anymore, and you can just write:

for (size_t i = 0; i < adjacent_matches.size() - 1; ++i) {
 if (adjacent_matches[i] == max_match_length) {
 result.emplace_back(
 std::string_view(text.data() + index[i], max_match_length),
 std::vector(&index[i], &index[i + repetitions])
 );
 }
}

I really hate the idea to pass 5 parameters to collectRepeatingSubstrings function despite its benefits. Too many and too way easy to mix parameters, especially when comes it last size_t variables. Well, text, index and adjacent_matches could be wrapped in one structure, but the naming and composing question is still open: [The best place to store index to data][3]

I think you've already answered it mostly yourself. Now you just need to come up with a good name for that structure. I would try this yourself first, as this is a good exercise.

Four parameters passed to calculateAdjacents functions is not good, as well.

That is solved in exactly the same way.

Having both findMatchLengthWithOffsetInSuffixArray and findMatchLength now seems to be redundant, but I don’t see a good way to get rid of one of them.

Since findMatchLength() is only ever called from findMatchLengthWithOffsetInSuffixArray(), you could inline the former into the latter. But why not just keep them both?

I would be happy to avoid this std::fill' at the beginning of calculateAdjacents` function, but I am not sure if this is possible.

There are several ways to do this. First, you can just use std::vector::assign():

adjacent_matches.assign(index.size(), 0);

But instead of passing it as an out-parameter, just make it part of the return value. Then you can just initialize a fresh new vector:

struct AdjacentMatches {
 std::size_t max_match_length;
 std::vector<std::size_t> adjacent_matches;
};
AdjacentMatches calculateAdjacentsByForEach(const std::string& text, 
 const std::vector<index_t>& index, 
 size_t repetitions)
{
 AdjacentMatches result{0, std::vector<std::size_t>(index.size(), 0)};
 ...
 result.adjacent_matches[idx] = match_length;
 ...
 result.max_match_length = max_match_length;
}

And then you have a nice struct to pass to collectRepeatingSubstrings(). The only possible drawback is that these methods don't allow for parallelism. But unless the input is huge, I don't think it matters, and even if it does, consider that zeroing a vector would probably be memory bandwidth bound instead of CPU bound.

Use `std::string_view` from the very start

If findRepeatedSubsequences() doesn't need to modify the passed in string, make it take it as a std::string_view instead of a const std::string&. The caller doesn't have to change anything. But we can avoid the ugly conversion of a substring to std::string_view, and instead write:

occur.str = text.substr(index[i], max_match_length);

And:

size_t findMatchLengthWithOffsetInSuffixArray(std::string_vioew text,
 const std::vector<index_t>& index, 
 const size_t idx,
 ptrdiff_t offset)
{
 return findMatchLength(text.substr(index[idx]),
 text.substr(index[idx + offset]));
}

Question 3

G. Sliepen, thank you again for your time and help here. (Improving code is the only way to make sure that I understood your review and the only way to get next corrections, so for sure my goal is to get to the best code we could get here.). I reworked the code according to your recommendations in posted in Fast search for the longest repeating substrings in large text (Rev.3). Could you please take a look and continue the code review for it?

G. Sliepen G. Sliepen 68.7k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2024-01-22 21:55:13Z

It's always nice to see people improve their code after a review!

About your concerns

One thing I am still in two minds about here in the spec. is should the function return the longest substring which repeats the exact number of times or "this number and more".

What behavior does the user need? If the choice doesn't matter for some reason, then just pick one and go with it.

To fix the defect when function finds substrings which are repeated not exactly requested number of times, but even more frequently I had to add border checks to ensure the number of repeats (see 2 extra calls to findMatchLengthWithOffsetInSuffixArray in calculateAdjacents functions and I hate this code most because of "almost duplication" of this code. Still looking for a way to collapse it in something shorter.

If you go for the "this number or more" version, then you can remove one of the calls to findMatchLengthWithOffsetInSuffixArray(). But then there is another subtle defect: when it returns the longest equal sequence repeated at least N times, it will only collect exactly N positions of the repeated sequences, even if there are more.

I still dislike the max_repetitions threshold. It seems that it is better to limit by maximum and minimum length of found substrings, although this is not so straightforward, since by a new criterion line repeated (exactly) 3 times could be longer than line repeated 2 times, so it is hard to develop stop-criteria this way.

If you set max_repetitions to text.size() you will have a valid stop-criterion. This will cause a lot of work to be done for long inputs though.

A better criterion can be found by analyzing the suffix array, and finding the longest consecutive sequence of entries that have the same first character, and use the length of that sequence as max_repetitions.

The inner loop in function collectRepeatingSubstrings is still hard to read.

Let's look at that inner loop:

for (size_t match = i; match <= std::min(i + (repetitions - 1), adjacent_matches.size() - 1); ++match) {
 occur.pos.emplace_back(index[match]);
}

Why is the call to std::min() there? We already know we have at least repetitions entries starting at i. So we can remove that. Then we can let the loop go from 0 to repetitions:

for (size_t match = 0; match < repetitions; ++match) {
 occur.pos.emplace_back(index[i + match]);
}

Maybe now it's easier to see that we are just copying a slice from index into occur.pos. But your spider senses should be tingling at this point: surely the standard library has some way to copy a subrange from a vector into another vector? There are indeed various ways to do this. One of them is by using the overload of std::vector's constructor that takes two iterators. Now you don't need the inner loop at all anymore, and you can just write:

for (size_t i = 0; i < adjacent_matches.size() - 1; ++i) {
 if (adjacent_matches[i] == max_match_length) {
 result.emplace_back(
 std::string_view(text.data() + index[i], max_match_length),
 std::vector(&index[i], &index[i + repetitions])
 );
 }
}

I really hate the idea to pass 5 parameters to collectRepeatingSubstrings function despite its benefits. Too many and too way easy to mix parameters, especially when comes it last size_t variables. Well, text, index and adjacent_matches could be wrapped in one structure, but the naming and composing question is still open: [The best place to store index to data][3]

I think you've already answered it mostly yourself. Now you just need to come up with a good name for that structure. I would try this yourself first, as this is a good exercise.

Four parameters passed to calculateAdjacents functions is not good, as well.

That is solved in exactly the same way.

Having both findMatchLengthWithOffsetInSuffixArray and findMatchLength now seems to be redundant, but I don’t see a good way to get rid of one of them.

Since findMatchLength() is only ever called from findMatchLengthWithOffsetInSuffixArray(), you could inline the former into the latter. But why not just keep them both?

I would be happy to avoid this std::fill' at the beginning of calculateAdjacents` function, but I am not sure if this is possible.

There are several ways to do this. First, you can just use std::vector::assign():

adjacent_matches.assign(index.size(), 0);

But instead of passing it as an out-parameter, just make it part of the return value. Then you can just initialize a fresh new vector:

struct AdjacentMatches {
 std::size_t max_match_length;
 std::vector<std::size_t> adjacent_matches;
};
AdjacentMatches calculateAdjacentsByForEach(const std::string& text, 
 const std::vector<index_t>& index, 
 size_t repetitions)
{
 AdjacentMatches result{0, std::vector<std::size_t>(index.size(), 0)};
 ...
 result.adjacent_matches[idx] = match_length;
 ...
 result.max_match_length = max_match_length;
}

And then you have a nice struct to pass to collectRepeatingSubstrings(). The only possible drawback is that these methods don't allow for parallelism. But unless the input is huge, I don't think it matters, and even if it does, consider that zeroing a vector would probably be memory bandwidth bound instead of CPU bound.

Use `std::string_view` from the very start

If findRepeatedSubsequences() doesn't need to modify the passed in string, make it take it as a std::string_view instead of a const std::string&. The caller doesn't have to change anything. But we can avoid the ugly conversion of a substring to std::string_view, and instead write:

occur.str = text.substr(index[i], max_match_length);

And:

size_t findMatchLengthWithOffsetInSuffixArray(std::string_vioew text,
 const std::vector<index_t>& index, 
 const size_t idx,
 ptrdiff_t offset)
{
 return findMatchLength(text.substr(index[idx]),
 text.substr(index[idx + offset]));
}

G. Sliepen, thank you again for your time and help here. (Improving code is the only way to make sure that I understood your review and the only way to get next corrections, so for sure my goal is to get to the best code we could get here.). I reworked the code according to your recommendations in posted in Fast search for the longest repeating substrings in large text (Rev.3). Could you please take a look and continue the code review for it?

Stack Exchange Network

Fast search for the longest repeating substrings in large text (Rev.2)

1 Answer 1

About your concerns

Use `std::string_view` from the very start

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Fast search for the longest repeating substrings in large text (Rev.2)

1 Answer 1

About your concerns

Use std::string_view from the very start

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Use `std::string_view` from the very start