Return to Revisions

1 of 6

asked Jan 25, 2024 at 21:56

Fast search for the longest repeating substrings in large text (Rev.3)

This is the third iteration of the Fast search for the longest repeating substrings in large text (Rev.2) code review. Special thanks goes to G. Sliepen who conducted the first two reviews.

Functional specification

Having a large text and already built suffix array find the longest substrings (limited by the number of repetitions) getting benefit from already existing prefix array, if reasonable. Provide the version which would benefit from multithreading keeping the code as simple as possible.

Implementation

The fully functional demo.

Reservation: current implementation utilizes execution policies, so using ranges/projections/views under the question; would work if multi-threading requirement from above met. The further details make sense only as addition to aforementioned second Code Review with answer to it.

Major changes

Since in real project I use lexem identifiers instead of plain text, I did some "emulation" in code for this, making pseudo tokenizer which for the sake of simplicity just converts characters’ codes to ints. This is just a trick to go to real types I use and move from std::string and std::string_views. Another step to make algorithm generic, meanwhile.
Function parameters text and index replaced with Index template which will be covered below.
"Subtle" defect for the "this number or more" version ("weak count") mentioned by G. Sliepen has been fixed in the collectRepeatingSubstrings function.
The max_repetitions threshold removed; it is not needed even on large data sets, no significant performance impact or memory footprint on real data.
The collectRepeatingSubstrings function reworked according recommendations. New version is much more readable.
The findMatchLength function inlined to findMatchLengthWithOffsetInSuffixArray to avoid swamping the namespace.
Introduced option strong_repetition_count to control the way the algorithm handles repetitions parameter: "find substring with exact number of repetitions" = "strong count" and "find substring with this number of repetitions or more" = "weak count".

My concerns on this code

I started with a dozen lines of code and now it grew to 175 lines of code most of which is still not reusable. I had one function, now I have 2 new types and 4 functions. Well, some defects were fixed which complicated the code a bit, but mostly I paid for functional decomposition. I am still in two minds that the price is reasonable. Anyway, this is still not a final version, so, let’s compare the final result.
While the function findRepeatedSubsequences is a kind of client function, I don’t like that std::cout (or any other stream) is build-in here. Of course, I can return new object RepeatedSubsequences from this function which will collect all the results and overload the operator << for this type in order to be able to output it in any stream in usual manner, but this will additionally complicate the code and will have performance and memory footprint on storing these results.

Index class

I had a special homework from G. Sliepen to figure out the name for the structure to wrap up references to text and index.

My way of thinking is following:

I really need only the index and it would be nice if reference to text is stored somehow in it.
It would be nice to have this index type transparent for the client code in order to avoid client code like index.index as much as possible. (I still have it in std::data(index.index), which could easily be replaced with index.begin() in this context, but I eager to learn how to handle this case, as well.

Before these code reviews I raised the question [The best place to store index to data]3 where I didn’t get better proposals than my way to wrap up the references to index and indexed text into Index structure:

template <typename LexemType=lexem_t, typename IndexType = index_t>
struct Index {
 const std::vector<LexemType>& text;
 const std::vector<IndexType>& index;
public:
 decltype(index.begin()) begin() const { return index.begin(); }
 decltype(index.end()) end() const { return index.end(); }
 size_t size() const { return index.size(); }
 const IndexType& operator[] (size_t idx) const { return index[idx]; }
};

Some reservations: I made it a structure to avoid excessive accessors. I don’t think they would help much here. Of course, having text and index public makes redundant exposing begin(), end(), size() and operator[]; I made it to simplify client code.

Other code snippets

Since I moved from std::string to std::vector to keep the sequences, I reworked SubstringOccurences to:

struct SubstringOccurences {
 std::span<const lexem_t> sequence;
 std::vector<index_t> pos;
};

I hope, this is the correct way and place to use std::span.

Generalization

One of the questions from G. Sliepen was "Why limit it to strings at all? It seems to me that you could make this work for any kind of range.".

The only reason I am keeping this not generic is that I want to keep implementation in source files, not in headers in order to maximize insulation (physical encapsulation), reduce compile time and keep visible interfaces free of utility functions.

If there is a way to make them generic without exposing them to header files, it would be nice to know. If not and it goes to header files, there is another question. The findMatchLengthWithOffsetInSuffixArray is not generic, I can’t explain to user what it does without the context of the task, as well as calculateAdjacents and * collectRepeatingSubstrings*. What is the best way to hide them from client? Create nested utility namespace? Wrap up everything in class which will be able to hold the results and support operator << and will provide encapsulation? Other methods?

Is there any way to improve this code? Any mistakes or issues? Could you please help conducting further steps of code review?

asked Jan 25, 2024 at 21:56

Damir Tenishev

lang-cpp