Return to Question

Link fixed

edited Jan 28, 2024 at 19:28

Before these code reviews I raised the question [The best place to store index to data]3The best place to store index to data where I didn’t get better proposals than my way to wrap up the references to index and indexed text into Index structure:

Before these code reviews I raised the question [The best place to store index to data]3 where I didn’t get better proposals than my way to wrap up the references to index and indexed text into Index structure:

Before these code reviews I raised the question The best place to store index to data where I didn’t get better proposals than my way to wrap up the references to index and indexed text into Index structure:

Code to review pasted from demo to the question

Source Link

edited Jan 26, 2024 at 13:16

Damir Tenishev

edited Jan 26, 2024 at 13:16

Damir Tenishev

On the dilemma "this number or more" or "exactly this number" it is really funny. I tested both on real texts and real environment they give much the same results; the difference is subtle. Main goal was to clean up text base from text duplicates (like the same story in different collections of works, etc.). Even more, the same goes to duplication defect from the first version. With this I mean that all defects, of course, must be fixed and option for different way of search should be provided, but in real environment even "slightly erroneous version" is "good enough". Just funny side note.
I tested both versions of calculateAdjacents with std::for_each and with std::transform_reduce on real data. You are correct, no significant difference in performance, which is highly reasonable, but since I am aiming to port this to GPU with many cores later, I decided to stay with std::transform_reduce which should give some speed up on GPU, I believe.
On keeping the findMatchLength, this is kind of adherence to Occam's razor principle to avoid unnecessary entities and a kind of rule of thumb like "If you can’t explain in simple words (preferably by function name) what it does generically, out of context of current task, avoid creating it or at least avoid making it visible to client. For this particular function I can’t explain in simple words what it does without this task context.
On the "I would be happy to avoid this `std::fill' at the beginning of calculateAdjacents function, but I am not sure if this is possible.", I most likely wrongly worded my intention. I mean I want to avoid clearing the vector this way or another. I am not concerned with the way it is done, I am concerned with the algorithm which requires it and I am looking for a way to avoid such performance-consuming operation with high memory-bounding. I will do my best to change the algorithm to avoid this in the subsequent versions. At the same time, I don’t want to move it to calculateAdjacents, because this will cause allocating and releasing this array on each function call. Taking into account that its size equal to text size, on large text bases this could be quite expensive; this is the reason I create the array on a higher level.
In real code I wouldn't use std::string and will use std::vector to keep identifiers (unsigned int) of lexems (actually, even a bit deeper – identifiers of categories or clusters of lexem, but this doesn’t matter at the moment). That’s why in the next revision I will move from std::string to std::vector just to better understand how to manipulate with sequences in absence of std::string and std::string_view functions.

Here is the current implementation of calculateAdjacents:

// Returns found sequence(s) length
size_t calculateAdjacents(const Index<>& index,
 std::vector<index_t>& adjacent_matches,
 size_t repetitions,
 bool strong_repetition_count)
{
 std::fill(std::execution::par_unseq, adjacent_matches.begin(), adjacent_matches.end(), 0);
 size_t max_match_length = std::transform_reduce(/*std::execution::par_unseq,*/
 index.begin(), index.end() - (repetitions - 1), size_t(0), 
 [&](size_t max_match_length, size_t match_length) -> size_t {
 return std::max(max_match_length, match_length);
 }, 
 [&](const auto& item) -> size_t {
 const auto& text = index.text;
 size_t idx = &item - std::data(index.index);
 
 size_t match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, (repetitions - 1));
 if (strong_repetition_count) {
 // Not more than min_repetitions times before item
 if (&item > std::data(index.index)) {
 size_t prev_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, -1);
 if (prev_match_length == match_length)
 return 0;
 }
 // Not more than min_repetitions times after (item+repetitions-1)
 if ((size_t)((&item + repetitions) - std::data(index.index)) < index.size()) {
 size_t next_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, repetitions);
 if (next_match_length == match_length)
 return 0;
 }
 }
 adjacent_matches[idx] = (index_t)match_length;
 return match_length;
 });
 return max_match_length;
}

On keeping the findMatchLength, this is kind of adherence to Occam's razor principle to avoid unnecessary entities and a kind of rule of thumb like "If you can’t explain in simple words (preferably by function name) what it does generically, out of context of current task, avoid creating it or at least avoid making it visible to client. For this particular function I can’t explain in simple words what it does without this task context.
On the "I would be happy to avoid this `std::fill' at the beginning of calculateAdjacents function, but I am not sure if this is possible.", I most likely wrongly worded my intention. I mean I want to avoid clearing the vector this way or another. I am not concerned with the way it is done, I am concerned with the algorithm which requires it and I am looking for a way to avoid such performance-consuming operation with high memory-bounding. I will do my best to change the algorithm to avoid this in the subsequent versions. At the same time, I don’t want to move it to calculateAdjacents, because this will cause allocating and releasing this array on each function call. Taking into account that its size equal to text size, on large text bases this could be quite expensive; this is the reason I create the array on a higher level.
In real code I wouldn't use std::string and will use std::vector to keep identifiers (unsigned int) of lexems (actually, even a bit deeper – identifiers of categories or clusters of lexem, but this doesn’t matter at the moment). That’s why in the next revision I will move from std::string to std::vector just to better understand how to manipulate with sequences in absence of std::string and std::string_view functions.

Since in real project I will use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

So, I introduced the:

using lexem_t = unsigned int;
using index_t = unsigned int;

Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.

Here it is:

struct SubstringOccurences {
 std::span<const lexem_t> sequence;
 std::vector<index_t> pos;
};
std::vector<SubstringOccurences>
collectRepeatingSubstrings(const Index<>& index,
 std::vector<index_t> adjacent_matches,
 size_t repetitions,
 size_t max_match_length)
{
 std::vector<SubstringOccurences> result;
 size_t j = 1;
 for (size_t i = 0; i < adjacent_matches.size() - 1; i+=j) {
 if (adjacent_matches[i] == max_match_length) {
 result.emplace_back(
 std::span(index.text.begin() + index[i], max_match_length),
 std::vector(&index[i], &index[i + repetitions])
 );
 }
 for (j = 1; i+j < adjacent_matches.size() 
 && max_match_length == findMatchLengthWithOffsetInSuffixArray(index, i, j); ++j)
 ;
 }
 return result;
}

The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.

Here is how it looks like now:

size_t findMatchLengthWithOffsetInSuffixArray(const Index<>& index, 
 const size_t idx,
 ptrdiff_t offset)
{
 const auto& text = index.text;
 return std::mismatch(text.begin() + index[idx], text.end(), 
 text.begin() + index[idx + offset], text.end()).first 
 - (text.begin() + index[idx]);
}

Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

I started with a dozen lines of code and now it grew to 175150+ lines of code most of which is still not reusable. I had one function, now I have 2 new types and 4 functions. Well, some defects were fixed which complicated the code a bit, but mostly I paid for functional decomposition. I am still in two minds that the price is reasonable. Anyway, this is still not a final version, so, let’s compare the final result.
While the function findRepeatedSubsequences is a kind of client function, I don’t like that std::cout (or any other stream) is build-in here. Of course, I can return new object RepeatedSubsequences from this function which will collect all the results and overload the operator << for this type in order to be able to output it in any stream in usual manner, but this will additionally complicate the code and will have performance and memory footprint on storing these results.

Here is how the functions looks like now:

void findRepeatedSubsequences(const Index<>& index, size_t max_repetitions, bool strong_repetition_count = true)
{
 std::vector<index_t> adjacent_matches(index.size());
 for (size_t repetitions = 2 ; repetitions <= max_repetitions; ++repetitions) {
 size_t sequence_length = calculateAdjacents(index, adjacent_matches, repetitions, strong_repetition_count);
 if (sequence_length < 1)
 continue;
 std::cout << "Longest equal sequence(s) repeated " << repetitions << " times is " << sequence_length << " chars.\n";
 for (auto sequence : collectRepeatingSubstrings(index, adjacent_matches, repetitions, sequence_length)) {
 std::cout << "The sequence at position(s): (";
 const char* delimiter = "";
 for (auto pos : sequence.pos) {
 std::cout << delimiter << pos;
 delimiter = ",";
 }
 std::cout << "): ";
 std::copy(sequence.sequence.begin(), sequence.sequence.end(), std::ostream_iterator<char>(std::cout));
 
 std::cout << "\n";
 }
 std::cout << "\n";
 }
}

On the dilemma "this number or more" or "exactly this number" it is really funny. I tested both on real texts and real environment they give much the same results; the difference is subtle. Main goal was to clean up text base from text duplicates (like the same story in different collections of works, etc.). Even more, the same goes to duplication defect from the first version. With this I mean that all defects, of course, must be fixed and option for different way of search should be provided, but in real environment even "slightly erroneous version" is "good enough". Just funny side note.
I tested both versions of calculateAdjacents with std::for_each and with std::transform_reduce on real data. You are correct, no significant difference in performance, which is highly reasonable, but since I am aiming to port this to GPU with many cores later, I decided to stay with std::transform_reduce which should give some speed up on GPU, I believe.
On keeping the findMatchLength, this is kind of adherence to Occam's razor principle to avoid unnecessary entities and a kind of rule of thumb like "If you can’t explain in simple words (preferably by function name) what it does generically, out of context of current task, avoid creating it or at least avoid making it visible to client. For this particular function I can’t explain in simple words what it does without this task context.
On the "I would be happy to avoid this `std::fill' at the beginning of calculateAdjacents function, but I am not sure if this is possible.", I most likely wrongly worded my intention. I mean I want to avoid clearing the vector this way or another. I am not concerned with the way it is done, I am concerned with the algorithm which requires it and I am looking for a way to avoid such performance-consuming operation with high memory-bounding. I will do my best to change the algorithm to avoid this in the subsequent versions. At the same time, I don’t want to move it to calculateAdjacents, because this will cause allocating and releasing this array on each function call. Taking into account that its size equal to text size, on large text bases this could be quite expensive; this is the reason I create the array on a higher level.
In real code I wouldn't use std::string and will use std::vector to keep identifiers (unsigned int) of lexems (actually, even a bit deeper – identifiers of categories or clusters of lexem, but this doesn’t matter at the moment). That’s why in the next revision I will move from std::string to std::vector just to better understand how to manipulate with sequences in absence of std::string and std::string_view functions.

Since in real project I will use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

I started with a dozen lines of code and now it grew to 175 lines of code most of which is still not reusable. I had one function, now I have 2 new types and 4 functions. Well, some defects were fixed which complicated the code a bit, but mostly I paid for functional decomposition. I am still in two minds that the price is reasonable. Anyway, this is still not a final version, so, let’s compare the final result.
While the function findRepeatedSubsequences is a kind of client function, I don’t like that std::cout (or any other stream) is build-in here. Of course, I can return new object RepeatedSubsequences from this function which will collect all the results and overload the operator << for this type in order to be able to output it in any stream in usual manner, but this will additionally complicate the code and will have performance and memory footprint on storing these results.

On the dilemma "this number or more" or "exactly this number" it is really funny. I tested both on real texts and real environment they give much the same results; the difference is subtle. Main goal was to clean up text base from text duplicates (like the same story in different collections of works, etc.). Even more, the same goes to duplication defect from the first version. With this I mean that all defects, of course, must be fixed and option for different way of search should be provided, but in real environment even "slightly erroneous version" is "good enough". Just funny side note.
I tested both versions of calculateAdjacents with std::for_each and with std::transform_reduce on real data. You are correct, no significant difference in performance, which is highly reasonable, but since I am aiming to port this to GPU with many cores later, I decided to stay with std::transform_reduce which should give some speed up on GPU, I believe.

Here is the current implementation of calculateAdjacents:

// Returns found sequence(s) length
size_t calculateAdjacents(const Index<>& index,
 std::vector<index_t>& adjacent_matches,
 size_t repetitions,
 bool strong_repetition_count)
{
 std::fill(std::execution::par_unseq, adjacent_matches.begin(), adjacent_matches.end(), 0);
 size_t max_match_length = std::transform_reduce(/*std::execution::par_unseq,*/
 index.begin(), index.end() - (repetitions - 1), size_t(0), 
 [&](size_t max_match_length, size_t match_length) -> size_t {
 return std::max(max_match_length, match_length);
 }, 
 [&](const auto& item) -> size_t {
 const auto& text = index.text;
 size_t idx = &item - std::data(index.index);
 
 size_t match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, (repetitions - 1));
 if (strong_repetition_count) {
 // Not more than min_repetitions times before item
 if (&item > std::data(index.index)) {
 size_t prev_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, -1);
 if (prev_match_length == match_length)
 return 0;
 }
 // Not more than min_repetitions times after (item+repetitions-1)
 if ((size_t)((&item + repetitions) - std::data(index.index)) < index.size()) {
 size_t next_match_length = findMatchLengthWithOffsetInSuffixArray(index, idx, repetitions);
 if (next_match_length == match_length)
 return 0;
 }
 }
 adjacent_matches[idx] = (index_t)match_length;
 return match_length;
 });
 return max_match_length;
}

On keeping the findMatchLength, this is kind of adherence to Occam's razor principle to avoid unnecessary entities and a kind of rule of thumb like "If you can’t explain in simple words (preferably by function name) what it does generically, out of context of current task, avoid creating it or at least avoid making it visible to client. For this particular function I can’t explain in simple words what it does without this task context.
On the "I would be happy to avoid this `std::fill' at the beginning of calculateAdjacents function, but I am not sure if this is possible.", I most likely wrongly worded my intention. I mean I want to avoid clearing the vector this way or another. I am not concerned with the way it is done, I am concerned with the algorithm which requires it and I am looking for a way to avoid such performance-consuming operation with high memory-bounding. I will do my best to change the algorithm to avoid this in the subsequent versions. At the same time, I don’t want to move it to calculateAdjacents, because this will cause allocating and releasing this array on each function call. Taking into account that its size equal to text size, on large text bases this could be quite expensive; this is the reason I create the array on a higher level.
In real code I wouldn't use std::string and will use std::vector to keep identifiers (unsigned int) of lexems (actually, even a bit deeper – identifiers of categories or clusters of lexem, but this doesn’t matter at the moment). That’s why in the next revision I will move from std::string to std::vector just to better understand how to manipulate with sequences in absence of std::string and std::string_view functions.

Since in real project I will use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.

So, I introduced the:

using lexem_t = unsigned int;
using index_t = unsigned int;

Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.

Here it is:

struct SubstringOccurences {
 std::span<const lexem_t> sequence;
 std::vector<index_t> pos;
};
std::vector<SubstringOccurences>
collectRepeatingSubstrings(const Index<>& index,
 std::vector<index_t> adjacent_matches,
 size_t repetitions,
 size_t max_match_length)
{
 std::vector<SubstringOccurences> result;
 size_t j = 1;
 for (size_t i = 0; i < adjacent_matches.size() - 1; i+=j) {
 if (adjacent_matches[i] == max_match_length) {
 result.emplace_back(
 std::span(index.text.begin() + index[i], max_match_length),
 std::vector(&index[i], &index[i + repetitions])
 );
 }
 for (j = 1; i+j < adjacent_matches.size() 
 && max_match_length == findMatchLengthWithOffsetInSuffixArray(index, i, j); ++j)
 ;
 }
 return result;
}

The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.

Here is how it looks like now:

size_t findMatchLengthWithOffsetInSuffixArray(const Index<>& index, 
 const size_t idx,
 ptrdiff_t offset)
{
 const auto& text = index.text;
 return std::mismatch(text.begin() + index[idx], text.end(), 
 text.begin() + index[idx + offset], text.end()).first 
 - (text.begin() + index[idx]);
}

Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

I started with a dozen lines of code and now it grew to 150+ lines of code most of which is still not reusable. I had one function, now I have 2 new types and 4 functions. Well, some defects were fixed which complicated the code a bit, but mostly I paid for functional decomposition. I am still in two minds that the price is reasonable. Anyway, this is still not a final version, so, let’s compare the final result.
While the function findRepeatedSubsequences is a kind of client function, I don’t like that std::cout (or any other stream) is build-in here. Of course, I can return new object RepeatedSubsequences from this function which will collect all the results and overload the operator << for this type in order to be able to output it in any stream in usual manner, but this will additionally complicate the code and will have performance and memory footprint on storing these results.

Here is how the functions looks like now:

void findRepeatedSubsequences(const Index<>& index, size_t max_repetitions, bool strong_repetition_count = true)
{
 std::vector<index_t> adjacent_matches(index.size());
 for (size_t repetitions = 2 ; repetitions <= max_repetitions; ++repetitions) {
 size_t sequence_length = calculateAdjacents(index, adjacent_matches, repetitions, strong_repetition_count);
 if (sequence_length < 1)
 continue;
 std::cout << "Longest equal sequence(s) repeated " << repetitions << " times is " << sequence_length << " chars.\n";
 for (auto sequence : collectRepeatingSubstrings(index, adjacent_matches, repetitions, sequence_length)) {
 std::cout << "The sequence at position(s): (";
 const char* delimiter = "";
 for (auto pos : sequence.pos) {
 std::cout << delimiter << pos;
 delimiter = ",";
 }
 std::cout << "): ";
 std::copy(sequence.sequence.begin(), sequence.sequence.end(), std::ostream_iterator<char>(std::cout));
 
 std::cout << "\n";
 }
 std::cout << "\n";
 }
}

wording

Source Link

edited Jan 25, 2024 at 23:46

Damir Tenishev

edited Jan 25, 2024 at 23:46

Damir Tenishev

Since in real project I will use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

If there is a way to make them generic without exposing them to header files, it would be nice to know. If not and it goes to header files, there is another question. The findMatchLengthWithOffsetInSuffixArray is not generic, I can’t explain to user what it does without the context of the task, as well as calculateAdjacents and * collectRepeatingSubstrings*collectRepeatingSubstrings. What is the best way to hide them from client? Create nested utility namespace? Wrap up everything in class which will be able to hold the results and support operator << and will provide encapsulation? Other methods?

Since in real project I use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

If there is a way to make them generic without exposing them to header files, it would be nice to know. If not and it goes to header files, there is another question. The findMatchLengthWithOffsetInSuffixArray is not generic, I can’t explain to user what it does without the context of the task, as well as calculateAdjacents and * collectRepeatingSubstrings*. What is the best way to hide them from client? Create nested utility namespace? Wrap up everything in class which will be able to hold the results and support operator << and will provide encapsulation? Other methods?

Since in real project I will use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

If there is a way to make them generic without exposing them to header files, it would be nice to know. If not and it goes to header files, there is another question. The findMatchLengthWithOffsetInSuffixArray is not generic, I can’t explain to user what it does without the context of the task, as well as calculateAdjacents and collectRepeatingSubstrings. What is the best way to hide them from client? Create nested utility namespace? Wrap up everything in class which will be able to hold the results and support operator << and will provide encapsulation? Other methods?