Return to Revisions

2 of 5

On regex performance

edited Feb 10, 2024 at 12:33

Damir Tenishev

edited Feb 10, 2024 at 12:33

Damir Tenishev

Natural language text fast tokenizer

Could you please conduct code review for the code below and suggest some improvements?

Functional specification

Implement a function for fast tokenization of text in char[] buffer handling some natural language specifics below:

Consider ‘ ‘ (space) as a delimiter, keeping a way to extends the list of delimiters later.
Extract stable collocations like "i.e.", "etc.", "..." as a single lexem.
In case word contains characters like ‘-‘ and ‘’’ (examples: semi-column, cat’s) return the whole construct as a whole lexem.
Return sequences of numbers (integers without signs) as a single lexem.

Performance is critical, since the amount of data is huge. The function must be thread-safe.

Design

Since performance is critical, the function work with raw pointers to const char. It gets the argument as the reference to the pointer to the place where from it should start parsing and updates this pointer, moving to the position past the read lexem.

Since initial characters could be delimiters, function returns the real starting position of the lexem found.

Concerns on the current implementation

Most likely, for such tasks the regex library should be used, but I am not sure if this "write-only" language (for me at least; you write once and can’t read and maintain it at all, rewriting from scratch every time) will be extendable when new requirements come. If I am wrong, I will be thankful for the maintainable version with the regex.

Another concern on the regex usage is performance. The code is expected to work with locales and this could be quite slow if underlying regex implementation somehow uses isalpha, etc.

I feel that with std::ranges this could be implemented simpler, so is case of any suggestions, please, share.

I feel that main loop could be simplified and nested loop at the end could be removed, but can't find better solution for now.

The code

The fully functional demo

#include <algorithm>
#include <iostream>
#include <vector>
#include <string.h>
// Returs lexem start point or nullptr if lexem not found
// Moves the passed pointer to the position past the lexem
inline const char* get_lexem(const char*& p)
{
 const static std::vector delimiters = { ' ' }; // Could be extened to many different delimiters
 const static std::vector<const char*> stable_lexems = { "i.e.", "etc.", "..." }; // Planned to be externally configurable
 const static std::vector<char> inword_lexems = { '-', '\'' }; // Not sure how to process this better
 const char* start = p;
 while (*p && p == start) {
 while (delimiters.end() != std::find(delimiters.begin(), delimiters.end(), *p)) {
 ++p;
 if (!*p)
 return nullptr;
 }
 auto it = std::find_if(stable_lexems.begin(), stable_lexems.end(), [&](const char* lexem) {
 size_t length = strlen(lexem);
 return !strncmp(p, lexem, length);
 });
 start = p;
 if (it != stable_lexems.end()) {
 p += strlen(*it);
 return start;
 }
 while (*p && (delimiters.end() == find(delimiters.begin(), delimiters.end(), *p))) {
 const bool is_inword_char = inword_lexems.end() != std::find(inword_lexems.begin(), inword_lexems.end(), *p);
 if (is_inword_char && p != start && isalpha(*(p - 1))) {
 ++p;
 continue;
 }
 if (!isalpha(*p) && !isdigit(*p)) {
 if (p == start) {
 ++p;
 }
 break;
 }
 ++p;
 }
 }
 return start;
}
int main()
{
 std::string sample = "Let's conisder this semi-simple sample, i.e. test data with ints: 100, etc. For ... some testing...";
 const char* lexem = nullptr;
 const char* lexem_end = sample.c_str();
 while (true) {
 lexem = get_lexem(lexem_end);
 if (!(lexem && lexem != lexem_end))
 break;
 std::string token(lexem, lexem_end - lexem);
 std::cout << token << "\n";
 }
}

c++ parsing

asked Feb 10, 2024 at 12:05

Damir Tenishev

lang-cpp