This is the forth iteration of the Fast search for the longest repeating substrings in large text (Rev.3) code review. Special thanks goes to G. Sliepen who conducted all the reviews.
Functional specification
Having a large text and already built suffix array find the longest substrings (limited by the number of repetitions) getting benefit from already existing prefix array, if reasonable. Provide the version which would benefit from multithreading keeping the code as simple as possible.
Implementation
The fully functional demo (https://godbolt.org/z/5q4G4Er1x).
Reservation: current implementation utilizes execution policies, so using ranges/projections/views under the question; would work if multi-threading requirement from above met. The further details make sense only as addition to aforementioned third Code Review with answer to it.
Answers and comments to the questions on previous revision code review
- Using C++ from early 1990th I still can’t get fully accustomed to
auto
keyword. Thank you for catching this, return types in Index class have mostly been reworked. One concern is replacingsize_t
with auto forsize()
. Won’tsize_t
more readable for the client when they see documentation on source of the header in case the implementation is separated and they see onlyauto size()
instead ofsize_t size()
? Just still in two minds, since Standard C++ Library mostly usessize_t
, notauto
for return values. - On having
Index
as a class template and other functions not. My idea behind this is thatIndex
(andSubstringOccurences
) as well asatomic_fetch_max
are good candidates for types shared across all the project, since they are quite common and can be used everywhere with different types. So, I am fine to broke the insulation (physical encapsulation) and give up to some coupling and compilation time increase. At the same time, I don’t expect other functions on my project to work with other types, so I keep them functions to isolate in C++ file. See more in my explanations Rev.3/Generalization. In case I see a second usage with another type, I will rework the to a template. Is it reasonable or there is a good reason to make them generic in advance? I just don’t want to have most of my project in header files. As a second thought, if I separate functionality across many header files and include only when necessary, the difference shouldn’t be huge. - Having a second look at the usage of array
adjacent_matches
in bothcalculateAdjacents
andcollectRepeatingSubstrings
I understood that I don’t need to clear anything at all, since the only (repetitions-1) elements I don’t use incalculateAdjacents
don’t need in thecollectRepeatingSubstrings
. So, I removed the clearing from thecalculateAdjacents
and fixed the loop in thecollectRepeatingSubstrings
, now it runs up toadjacent_matches.size() - (repetitions - 1)
element:
std::vector<SubstringOccurences<>>
collectRepeatingSubstrings(const Index<>& index,
std::vector<index_t> adjacent_matches,
size_t repetitions,
size_t max_match_length)
{
std::vector<SubstringOccurences<>> result;
size_t j = 1;
for (size_t i = 0; i < adjacent_matches.size() - (repetitions - 1); i += j) {
if (adjacent_matches[i] == max_match_length) {
result.emplace_back(
std::span(index.get_text().begin() + index[i], max_match_length),
std::vector(&index[i], &index[i + repetitions])
);
}
for (j = 1; i+j < adjacent_matches.size()
&& max_match_length == findMatchLengthWithOffsetInSuffixArray(index, i, j); ++j)
;
}
return result;
}
Thank you for catching this, your pinpoint on adjacent_matches
usage helped a lot here not only in terms of this code, but in my view to such write-only operations.
- I totally agree on all your comments on the Remove responsibilities from findRepeatedSubsequences section, so for the moment I decided to stay with existing code, avoiding unnecessary overcomplication.
- Despite getting another function, I like your idea to extract
isStrongRepetition
fromcalculateAdjacents
, it really helps to improve the code structure.
Now it looks like this:
bool isStrongRepetition(size_t match_length, const Index<>& index, size_t idx, size_t repetitions) {
// Not more than min_repetitions times before item
if (idx > 0) {
size_t prev_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, -1);
if (prev_match_length == match_length)
return false;
}
// Not more than min_repetitions times after (idx+repetitions-1)
if (idx + repetitions < index.size()) {
size_t next_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, repetitions);
if (next_match_length == match_length)
return false;
}
return true;
}
// Returns found sequence(s) length
size_t calculateAdjacents(const Index<>& index,
std::vector<index_t>& adjacent_matches,
size_t repetitions,
bool strong_repetition_count)
{
size_t max_match_length = std::transform_reduce(std::execution::par_unseq,
index.begin(), index.end() - (repetitions - 1), size_t(0),
[&](size_t max_match_length, size_t match_length) -> size_t {
return std::max(max_match_length, match_length);
},
[&](const auto& item) -> size_t {
size_t idx = &item - index.data();
size_t match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, (repetitions - 1));
if (strong_repetition_count && !isStrongRepetition(match_length, index, idx, repetitions)) {
match_length = 0;
}
adjacent_matches[idx] = (index_t)match_length;
return match_length;
});
return max_match_length;
}
- Totally agree on your comments on code size. One comment here, in my initial code (before I posted it here for review) the
findMatchLength
was inlined tofindRepeatedSubsequences
(see here, but they suggested to extract this function and journey has began). So, the very first version of the code was even shorter. Anyway, I totally agree with your summary here.
Other major changes
- Here is the reworked version of the
Index
class template:
template <typename LexemType=lexem_t, typename IndexType = index_t>
class Index {
private:
const std::vector<LexemType>& text;
const std::vector<IndexType>& index;
public:
Index(const auto& text, const auto& index) : text(text), index(index) {}
auto begin() const { return index.begin(); }
auto end() const { return index.end(); }
size_t size() const { return index.size(); }
auto data() const { return index.data(); }
const auto& operator[] (size_t idx) const { return index[idx]; }
const auto& get_text() const { return text; }
};
With the same question on the table, as I had in the previous code review round; it was my homework you gave me on the second round of code review to develop the name for it (and the approach itself). Are you agree with the solution for 'Index` or you had different or better ways to do this? See more reasoning on this class in the Rev.3 / Index class.
1 Answer 1
Answers to your questions
One concern is replacing size_t with
auto
forsize()
. Won’tsize_t
more readable for the client when they see documentation on source of the header in case the implementation is separated and they see onlyauto size()
instead ofsize_t size()
? Just still in two minds, since Standard C++ Library mostly usessize_t
, not auto for return values.
You have a point, but using an explicit return type for size()
but not for any of the other member functions is inconsistent. And, what if you change the type of container for index
, and index.size()
no longer returns a std::size_t
? I see two options:
- Just use
auto
. Maybe some IDEs can even deduce the actual type and show it. Also, do you really care about the exact type? The caller can also just useauto
to store the result, it doesn't need to know the exact type either. - Create type aliases derived from the container type. For example:
This is what the standard library does for all its containers, container adapters and many other classes.template <typename LexemType=lexem_t, typename IndexType = index_t> class Index { ... const std::vector<IndexType>& index; public: using size_type = decltype(index)::size_type; ... size_type size() const { return index.size(); } };
On having
Index
as a class template and other functions not. [...] Is it reasonable or there is a good reason to make them generic in advance?
I think you can apply the YAGNI principle here: unless you are really going to need it, don't make Index
a class template. The same goes for SubstringOccurences
.
Are you agree with the solution for
Index
or you had different or better ways to do this?
For the problem of finding repeated subsequences, I think Index
is fine. However, the name is very generic of course. So in a larger project, it should either be made more distinct, like SuffixArrayIndex
, or ensure it is not visible to other parts of the code, for example by ensuring it is only defined in a .cpp file, or by putting it in a namespace
.
Other ideas
I don't see much that can be improved (but maybe other reviewers will). Some ideas you might want to look into though:
- C++20 and later introduce ranges and views, and this allows for a more functional approach to writing algorithms. You could give that a go.
- You could
return
the result ofstd::transform_reduce()
directly without storing it in a variable first. On the other hand, this way you are giving a name to the value, which helps document the code.
-
\$\begingroup\$ Thank you so much for all your help here to get this much better than is was initially. Two questions, still: (1) I guess SuffixArrayIndex won't be a good name because
Index
represents any kind of index for underlying text, it is not tied to any particular indexing strategy (could be added as traits, I guess, btw!) or implementation; of course it will be packed to the respective namespace; (2) On ranges, doesn't my "Reservation: current implementation utilizes execution policies..." cover the reasons or can we reach the same without execution policies? \$\endgroup\$Damir Tenishev– Damir Tenishev2024年01月31日 23:59:35 +00:00Commented Jan 31, 2024 at 23:59