Fast search for the longest repeating substrings in large text (Rev.4)

Question 1

This is the forth iteration of the Fast search for the longest repeating substrings in large text (Rev.3) code review. Special thanks goes to G. Sliepen who conducted all the reviews.

Functional specification

Having a large text and already built suffix array find the longest substrings (limited by the number of repetitions) getting benefit from already existing prefix array, if reasonable. Provide the version which would benefit from multithreading keeping the code as simple as possible.

Implementation

The fully functional demo (https://godbolt.org/z/5q4G4Er1x).

Reservation: current implementation utilizes execution policies, so using ranges/projections/views under the question; would work if multi-threading requirement from above met. The further details make sense only as addition to aforementioned third Code Review with answer to it.

Answers and comments to the questions on previous revision code review

Using C++ from early 1990th I still can’t get fully accustomed to auto keyword. Thank you for catching this, return types in Index class have mostly been reworked. One concern is replacing size_t with auto for size(). Won’t size_t more readable for the client when they see documentation on source of the header in case the implementation is separated and they see only auto size() instead of size_t size()? Just still in two minds, since Standard C++ Library mostly uses size_t, not auto for return values.
On having Index as a class template and other functions not. My idea behind this is that Index (and SubstringOccurences) as well as atomic_fetch_max are good candidates for types shared across all the project, since they are quite common and can be used everywhere with different types. So, I am fine to broke the insulation (physical encapsulation) and give up to some coupling and compilation time increase. At the same time, I don’t expect other functions on my project to work with other types, so I keep them functions to isolate in C++ file. See more in my explanations Rev.3/Generalization. In case I see a second usage with another type, I will rework the to a template. Is it reasonable or there is a good reason to make them generic in advance? I just don’t want to have most of my project in header files. As a second thought, if I separate functionality across many header files and include only when necessary, the difference shouldn’t be huge.
Having a second look at the usage of array adjacent_matches in both calculateAdjacents and collectRepeatingSubstrings I understood that I don’t need to clear anything at all, since the only (repetitions-1) elements I don’t use in calculateAdjacents don’t need in the collectRepeatingSubstrings. So, I removed the clearing from the calculateAdjacents and fixed the loop in the collectRepeatingSubstrings, now it runs up to adjacent_matches.size() - (repetitions - 1) element:

std::vector<SubstringOccurences<>>
collectRepeatingSubstrings(const Index<>& index,
 std::vector<index_t> adjacent_matches,
 size_t repetitions,
 size_t max_match_length)
{
 std::vector<SubstringOccurences<>> result;
 size_t j = 1;
 for (size_t i = 0; i < adjacent_matches.size() - (repetitions - 1); i += j) {
 if (adjacent_matches[i] == max_match_length) {
 result.emplace_back(
 std::span(index.get_text().begin() + index[i], max_match_length),
 std::vector(&index[i], &index[i + repetitions])
 );
 }
 for (j = 1; i+j < adjacent_matches.size() 
 && max_match_length == findMatchLengthWithOffsetInSuffixArray(index, i, j); ++j)
 ;
 }
 return result;
}

Thank you for catching this, your pinpoint on adjacent_matches usage helped a lot here not only in terms of this code, but in my view to such write-only operations.

I totally agree on all your comments on the Remove responsibilities from findRepeatedSubsequences section, so for the moment I decided to stay with existing code, avoiding unnecessary overcomplication.
Despite getting another function, I like your idea to extract isStrongRepetition from calculateAdjacents, it really helps to improve the code structure.

Now it looks like this:

bool isStrongRepetition(size_t match_length, const Index<>& index, size_t idx, size_t repetitions) {
 // Not more than min_repetitions times before item
 if (idx > 0) {
 size_t prev_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, -1);
 if (prev_match_length == match_length)
 return false;
 }
 // Not more than min_repetitions times after (idx+repetitions-1)
 if (idx + repetitions < index.size()) {
 size_t next_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, repetitions);
 if (next_match_length == match_length)
 return false;
 }
 return true;
}
// Returns found sequence(s) length
size_t calculateAdjacents(const Index<>& index,
 std::vector<index_t>& adjacent_matches,
 size_t repetitions,
 bool strong_repetition_count)
{
 size_t max_match_length = std::transform_reduce(std::execution::par_unseq,
 index.begin(), index.end() - (repetitions - 1), size_t(0), 
 [&](size_t max_match_length, size_t match_length) -> size_t {
 return std::max(max_match_length, match_length);
 }, 
 [&](const auto& item) -> size_t {
 size_t idx = &item - index.data();
 
 size_t match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, (repetitions - 1));
 if (strong_repetition_count && !isStrongRepetition(match_length, index, idx, repetitions)) {
 match_length = 0;
 }
 adjacent_matches[idx] = (index_t)match_length;
 return match_length;
 });
 return max_match_length;
}

Totally agree on your comments on code size. One comment here, in my initial code (before I posted it here for review) the findMatchLength was inlined to findRepeatedSubsequences (see here, but they suggested to extract this function and journey has began). So, the very first version of the code was even shorter. Anyway, I totally agree with your summary here.

Other major changes

Here is the reworked version of the Index class template:

template <typename LexemType=lexem_t, typename IndexType = index_t>
class Index {
private:
 const std::vector<LexemType>& text;
 const std::vector<IndexType>& index;
public:
 Index(const auto& text, const auto& index) : text(text), index(index) {}
 auto begin() const { return index.begin(); }
 auto end() const { return index.end(); }
 size_t size() const { return index.size(); }
 auto data() const { return index.data(); }
 const auto& operator[] (size_t idx) const { return index[idx]; }
 const auto& get_text() const { return text; }
};

With the same question on the table, as I had in the previous code review round; it was my homework you gave me on the second round of code review to develop the name for it (and the approach itself). Are you agree with the solution for 'Index` or you had different or better ways to do this? See more reasoning on this class in the Rev.3 / Index class.

Question 2

Answers to your questions

One concern is replacing size_t with auto for size(). Won’t size_t more readable for the client when they see documentation on source of the header in case the implementation is separated and they see only auto size() instead of size_t size()? Just still in two minds, since Standard C++ Library mostly uses size_t, not auto for return values.

You have a point, but using an explicit return type for size() but not for any of the other member functions is inconsistent. And, what if you change the type of container for index, and index.size() no longer returns a std::size_t? I see two options:

Just use auto. Maybe some IDEs can even deduce the actual type and show it. Also, do you really care about the exact type? The caller can also just use auto to store the result, it doesn't need to know the exact type either.

Create type aliases derived from the container type. For example:

template <typename LexemType=lexem_t, typename IndexType = index_t>
class Index {
 ...
 const std::vector<IndexType>& index;
public:
 using size_type = decltype(index)::size_type;
 ...
 size_type size() const { return index.size(); }
};

This is what the standard library does for all its containers, container adapters and many other classes.

On having Index as a class template and other functions not. [...] Is it reasonable or there is a good reason to make them generic in advance?

I think you can apply the YAGNI principle here: unless you are really going to need it, don't make Index a class template. The same goes for SubstringOccurences.

Are you agree with the solution for Index or you had different or better ways to do this?

For the problem of finding repeated subsequences, I think Index is fine. However, the name is very generic of course. So in a larger project, it should either be made more distinct, like SuffixArrayIndex, or ensure it is not visible to other parts of the code, for example by ensuring it is only defined in a .cpp file, or by putting it in a namespace.

Other ideas

I don't see much that can be improved (but maybe other reviewers will). Some ideas you might want to look into though:

C++20 and later introduce ranges and views, and this allows for a more functional approach to writing algorithms. You could give that a go.
You could return the result of std::transform_reduce() directly without storing it in a variable first. On the other hand, this way you are giving a name to the value, which helps document the code.

Question 3

Thank you so much for all your help here to get this much better than is was initially. Two questions, still: (1) I guess SuffixArrayIndex won't be a good name because Index represents any kind of index for underlying text, it is not tied to any particular indexing strategy (could be added as traits, I guess, btw!) or implementation; of course it will be packed to the respective namespace; (2) On ranges, doesn't my "Reservation: current implementation utilizes execution policies..." cover the reasons or can we reach the same without execution policies?

G. Sliepen G. Sliepen 68.9k3 gold badges74 silver badges180 bronze badges · Accepted Answer · 2024-01-31 22:47:17Z

Answers to your questions

One concern is replacing size_t with auto for size(). Won’t size_t more readable for the client when they see documentation on source of the header in case the implementation is separated and they see only auto size() instead of size_t size()? Just still in two minds, since Standard C++ Library mostly uses size_t, not auto for return values.

You have a point, but using an explicit return type for size() but not for any of the other member functions is inconsistent. And, what if you change the type of container for index, and index.size() no longer returns a std::size_t? I see two options:

Just use auto. Maybe some IDEs can even deduce the actual type and show it. Also, do you really care about the exact type? The caller can also just use auto to store the result, it doesn't need to know the exact type either.

Create type aliases derived from the container type. For example:

template <typename LexemType=lexem_t, typename IndexType = index_t>
class Index {
 ...
 const std::vector<IndexType>& index;
public:
 using size_type = decltype(index)::size_type;
 ...
 size_type size() const { return index.size(); }
};

This is what the standard library does for all its containers, container adapters and many other classes.

On having Index as a class template and other functions not. [...] Is it reasonable or there is a good reason to make them generic in advance?

I think you can apply the YAGNI principle here: unless you are really going to need it, don't make Index a class template. The same goes for SubstringOccurences.

Are you agree with the solution for Index or you had different or better ways to do this?

For the problem of finding repeated subsequences, I think Index is fine. However, the name is very generic of course. So in a larger project, it should either be made more distinct, like SuffixArrayIndex, or ensure it is not visible to other parts of the code, for example by ensuring it is only defined in a .cpp file, or by putting it in a namespace.

Other ideas

I don't see much that can be improved (but maybe other reviewers will). Some ideas you might want to look into though:

C++20 and later introduce ranges and views, and this allows for a more functional approach to writing algorithms. You could give that a go.
You could return the result of std::transform_reduce() directly without storing it in a variable first. On the other hand, this way you are giving a name to the value, which helps document the code.

Thank you so much for all your help here to get this much better than is was initially. Two questions, still: (1) I guess SuffixArrayIndex won't be a good name because Index represents any kind of index for underlying text, it is not tied to any particular indexing strategy (could be added as traits, I guess, btw!) or implementation; of course it will be packed to the respective namespace; (2) On ranges, doesn't my "Reservation: current implementation utilizes execution policies..." cover the reasons or can we reach the same without execution policies?

Stack Exchange Network

Fast search for the longest repeating substrings in large text (Rev.4)

1 Answer 1

Answers to your questions

Other ideas

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Fast search for the longest repeating substrings in large text (Rev.4)

1 Answer 1

Answers to your questions

Other ideas

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions