Natural language text fast tokenizer

Question 1

Could you please conduct code review for the code below and suggest some improvements?

Functional specification

Implement a function for fast tokenization of text in char[] buffer handling some natural language specifics below:

Consider ‘ ‘ (space) as a delimiter, keeping a way to extends the list of delimiters later.
Extract stable collocations like "i.e.", "etc.", "..." as a single lexem.
In case word contains characters like ‘-‘ and ‘’’ (examples: semi-column, cat’s) return the whole construct as a whole lexem.
Return sequences of numbers (integers without signs) as a single lexem.

Performance is critical, since the amount of data is huge. The function must be thread-safe.

Design

Since performance is critical, the function work with raw pointers to const char. It gets the argument as the reference to the pointer to the place where from it should start parsing and updates this pointer, moving to the position past the read lexem.

Since initial characters could be delimiters, function returns the real starting position of the lexem found.

Concerns on the current implementation

Most likely, for such tasks the regex library should be used, but I am not sure if this "write-only" language (regex) (for me at least; you write once and can’t read and maintain it at all, rewriting from scratch every time) will be extendable when new requirements come. If I am wrong, I will be thankful for the maintainable version with the regex.

Another concern on the regex usage is performance. The code is expected to work with locales and this could be quite slow if underlying regex implementation somehow uses isalpha, etc.

I feel that with std::ranges this could be implemented simpler, so is case of any suggestions, please, share.

I feel that main loop could be simplified and nested loop at the end could be removed, but can't find better solution for now.

With all these static vectors and intention to make it configurable and extendable in future, I am in two minds if to make a class or namespace from this in order to being able to configure with delimiters, stable lexems, inwords lexems, etc. Would it be extendability or overengineering?

The code

The fully functional demo

#include <algorithm>
#include <iostream>
#include <vector>
#include <string.h>
// Returs lexem start point or nullptr if lexem not found
// Moves the passed pointer to the position past the lexem
inline const char* get_lexem(const char*& p)
{
 const static std::vector delimiters = { ' ' }; // Could be extened to many different delimiters
 const static std::vector<const char*> stable_lexems = { "i.e.", "etc.", "..." }; // Planned to be externally configurable
 const static std::vector<char> inword_lexems = { '-', '\'' }; // Not sure how to process this better
 const char* start = p;
 while (*p && p == start) {
 while (delimiters.end() != std::find(delimiters.begin(), delimiters.end(), *p)) {
 ++p;
 if (!*p)
 return nullptr;
 }
 auto it = std::find_if(stable_lexems.begin(), stable_lexems.end(), [&](const char* lexem) {
 size_t length = strlen(lexem);
 return !strncmp(p, lexem, length);
 });
 start = p;
 if (it != stable_lexems.end()) {
 p += strlen(*it);
 return start;
 }
 while (*p && (delimiters.end() == find(delimiters.begin(), delimiters.end(), *p))) {
 const bool is_inword_char = inword_lexems.end() != std::find(inword_lexems.begin(), inword_lexems.end(), *p);
 if (is_inword_char && p != start && isalpha(*(p - 1))) {
 ++p;
 continue;
 }
 if (!isalpha(*p) && !isdigit(*p)) {
 if (p == start) {
 ++p;
 }
 break;
 }
 ++p;
 }
 }
 return start;
}
int main()
{
 const char sample[] = "Let's conisder this semi-simple sample, i.e. test data with ints: 100, etc. For ... some testing...";
 const char* lexem = nullptr;
 const char* lexem_end = sample;
 while (true) {
 lexem = get_lexem(lexem_end);
 if (!(lexem && lexem != lexem_end))
 break;
 std::string token(lexem, lexem_end - lexem);
 std::cout << token << "\n";
 }
}

Question 2

OK, but I am still wondering why the mixture of using std::string and char *. Your sample is declared in std::string, why not keep it align with const char*?

Question 3

@JimmyHu, since the question wasn't answered yet, I've updated the code in the question to cover your concern to avoid future questions on this. I believe, this fits the rules. The code of the function provided for code review has not changed. (Feel free to remove your comments if you will they don't match the code since the changes).

Question 4

It's really nice that you included tests, because the code behaves slightly differently than the leading description would entail. In particular, you have hard-coded that any non-alphanumeric character is a delimiter.

Question 5

The code assumes single-byte characters. Is that a safe assumption for your project? A char[] buffer is commonly used for text even when the encoding may use a variable number of bytes per "character" (e.g., UTF-8).

Question 6

Note that the character classification functions like isalpha take an int. When you pass a char, it'll be promoted. But if your system uses signed values for char, you run the risk of undefined behavior.

Question 7

Avoid using `char*`

C++ provides much nicer and safer ways to deal with strings and string slices than C's char*. Before C++17, I recommend you just use std::string where possible. This might cause some unnecessary copies to be made though. Luckily, since C++17 we have std::string_view; this is just a view of another string, it doesn't hold a copy itself.

Make it work like a range

Wouldn't it be nice if the code in main() could be written like this?

int main() {
 std::string sample = "Let's consider...";
 for (auto token: tokenize(sample)) {
 std::cout << token << '\n';
 }
}

You can do this by making a function tokenize() that returns a class that has two functions, begin() and end(), that both return a token iterator. Here is a possible way to do this:

class TokenRange {
 std::string_view data;
public:
 class Iterator {
 std::string_view data;
 public:
 Iterator(std::string_view data = {}): data(data) {}
 Iterator& operator++();
 std::string_view operator*() const;
 friend bool operator==(const Iterator&, const Iterator&) = default;
 };
 TokenRange(std::string_view data): data(data) {}
 Iterator begin() {
 return Iterator(data);
 }
 Iterator end() {
 return {};
 }
};

So now if you write tokenize(sample).begin(), you get a TokenRange::Iterator whose member data is a view of the whole string sample. Now if you try to dereference that iterator using *, then you expect it to return the value of the first item in the range, or in this case, the first token. That's most of what you do in get_lexem(). So get_lexem() can be turned into TokenRange::Iterator::operator*().

Of course, operator++() will probably be called soon afterwards, so you want to make sure you don't duplicate most of operator*() just to know how much you have to skip over. Find some way to only have to scan for a token once.

While std::string_view is more than just a pointer, and you might even need more member variables in Iterator, any decent compiler will inline all these things and optimize them away.

Making a standards-compliant iterator is a little bit more work, but you can inherit from std::iterator. Also see this tutorial.

Consider making use of more recent C++ features

Instead of std::find(delimiters.begin(), delimiters.end(), *p), you could write std::ranges::find(delimiters, *p) instead. Or even better:

while (std::ranges::contains(delimiters, *p)) {
 ++p;
}
if (!*p) {
 ...
}

Use std::string_view's compare() member function instead of strncmp(), and of course size() insead of strlen(). Or if you just want to check if a string begins with a given text, then use starts_with().

Going full `std::ranges`

It sounds like this should be a problem that could be solved with std::ranges. Something like:

auto tokens = data
 | std::views::lazy_split(delimiters)
 | ...;

However, your definition of a token is complex enough that writing it out using just views would either result in some horrible looking code, or it might be very inefficient. If I would go this route, then I would start with a function that maps each charater to some enum that describes which class it belongs to: delimiter, in-word lexem, alphanumerics and other. Then you can use other views to split on delimiters, and use something like std::views::adjacent_transform() to eliminate in-word lexems at the start of a token. Still, checking for stable lexems seems to be hard.

I would probably keep the structure of get_lexem(). However, just like you can make a tokenize() function that returns a range, you could make your own tokenize_view() so you can write:

auto tokens = data | tokenize_view();

The main advantage of that would be that you could then use that inside a larger pipeline of views.

Question 8

Thank you so much for your help here. As was really surprised how fast you've done this code review; it was like you had this implemented before :). I have updated the code according to your recommendations and posted Rev2. Could you please check if my understanding is correct? And about char * vs std::string_view, shame on me, I stepped on this already second time; you already caught this on previous reviews. My concern was performance, but now I see (see in rev2) that this is not the case.

Question 9

To keep you posted, I've published the Rev.3. Could you please take a look?

Question 10

Prefer `<cstring>` to `<string.h>`

The C compatibility headers should only be used for extern "C" interfaces. For C++ programs, use the C++ headers, which definitely declare identifiers in the std namespace (e.g. std::size_t, std::strncmp, std::strlen).

Similarly, we should include <cctype> to declare std::isalpha and std::isdigit. That said, for a programming language parser we probably want to use the <locales> version of these, passing std::locale::classic() to avoid using the unknown global locale. That would neatly avoid the bug we have here, where we pass these functions a plain char rather than the appropriate positive int.

Question 11

Thank you for the input. On your points: (1) C++ headers (including <ctype>), indeed, better, will follow, but for the moment at least with MSVS 2022 it is enough to include <iostream> here and with the long mess in these headers, you will get all of them; so this makes sence only with lightweight headers; in my case <string.h> is a legacy from times only CRT exsited, will remove, of course; (2) taking into account the way both size_t and std::size_t defined, I would stay with shorter version so far;

Question 12

(3) not sure what do you mean with passing the locale (std::locale::classic()); do you mean you conisder passing it as a second argument for calls? (4) Where do you see a bug with passing char instead of int? Don't implicit coversions exist?

Question 13

The <cctype> functions take a positive int, even on platforms where char is signed. It's a bug to pass a negative char to these functions, as it will be widened to a negative int. Instead we need to convert to unsigned char so that widening to int produces a positive value. And yes, the <locale> functions such as std::isalpha() take the locale as second argument (and you can pass plain char as first argument, unlike the corresponding <cctype> functions).

Question 14

Thank you for the explanations on the <cctype>, will be on guard here. The problem with the second argument for locale I covered here (mentioned in the question and you participated in the discussion); it is quite slow, so anyway I will use fast_tolower as a wrapper using your advice in both topics. I was in two minds if I need to overcomplicate code here with this.

Question 15

Thank you so much for your help here. I've updated the code taking into account your recommendations and posted Rev2. Could you please revisit it?

G. Sliepen G. Sliepen 69k3 gold badges74 silver badges180 bronze badges · Accepted Answer · 2024-02-10 13:34:36Z

Avoid using `char*`

C++ provides much nicer and safer ways to deal with strings and string slices than C's char*. Before C++17, I recommend you just use std::string where possible. This might cause some unnecessary copies to be made though. Luckily, since C++17 we have std::string_view; this is just a view of another string, it doesn't hold a copy itself.

Make it work like a range

Wouldn't it be nice if the code in main() could be written like this?

int main() {
 std::string sample = "Let's consider...";
 for (auto token: tokenize(sample)) {
 std::cout << token << '\n';
 }
}

You can do this by making a function tokenize() that returns a class that has two functions, begin() and end(), that both return a token iterator. Here is a possible way to do this:

class TokenRange {
 std::string_view data;
public:
 class Iterator {
 std::string_view data;
 public:
 Iterator(std::string_view data = {}): data(data) {}
 Iterator& operator++();
 std::string_view operator*() const;
 friend bool operator==(const Iterator&, const Iterator&) = default;
 };
 TokenRange(std::string_view data): data(data) {}
 Iterator begin() {
 return Iterator(data);
 }
 Iterator end() {
 return {};
 }
};

So now if you write tokenize(sample).begin(), you get a TokenRange::Iterator whose member data is a view of the whole string sample. Now if you try to dereference that iterator using *, then you expect it to return the value of the first item in the range, or in this case, the first token. That's most of what you do in get_lexem(). So get_lexem() can be turned into TokenRange::Iterator::operator*().

Of course, operator++() will probably be called soon afterwards, so you want to make sure you don't duplicate most of operator*() just to know how much you have to skip over. Find some way to only have to scan for a token once.

While std::string_view is more than just a pointer, and you might even need more member variables in Iterator, any decent compiler will inline all these things and optimize them away.

Making a standards-compliant iterator is a little bit more work, but you can inherit from std::iterator. Also see this tutorial.

Consider making use of more recent C++ features

Instead of std::find(delimiters.begin(), delimiters.end(), *p), you could write std::ranges::find(delimiters, *p) instead. Or even better:

while (std::ranges::contains(delimiters, *p)) {
 ++p;
}
if (!*p) {
 ...
}

Use std::string_view's compare() member function instead of strncmp(), and of course size() insead of strlen(). Or if you just want to check if a string begins with a given text, then use starts_with().

Going full `std::ranges`

It sounds like this should be a problem that could be solved with std::ranges. Something like:

auto tokens = data
 | std::views::lazy_split(delimiters)
 | ...;

However, your definition of a token is complex enough that writing it out using just views would either result in some horrible looking code, or it might be very inefficient. If I would go this route, then I would start with a function that maps each charater to some enum that describes which class it belongs to: delimiter, in-word lexem, alphanumerics and other. Then you can use other views to split on delimiters, and use something like std::views::adjacent_transform() to eliminate in-word lexems at the start of a token. Still, checking for stable lexems seems to be hard.

I would probably keep the structure of get_lexem(). However, just like you can make a tokenize() function that returns a range, you could make your own tokenize_view() so you can write:

auto tokens = data | tokenize_view();

The main advantage of that would be that you could then use that inside a larger pipeline of views.

Thank you so much for your help here. As was really surprised how fast you've done this code review; it was like you had this implemented before :). I have updated the code according to your recommendations and posted Rev2. Could you please check if my understanding is correct? And about char * vs std::string_view, shame on me, I stepped on this already second time; you already caught this on previous reviews. My concern was performance, but now I see (see in rev2) that this is not the case.
To keep you posted, I've published the Rev.3. Could you please take a look?

Stack Exchange Network

Natural language text fast tokenizer

2 Answers 2

Avoid using `char*`

Make it work like a range

Consider making use of more recent C++ features

Going full `std::ranges`

Prefer `<cstring>` to `<string.h>`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Natural language text fast tokenizer

2 Answers 2

Avoid using char*

Make it work like a range

Consider making use of more recent C++ features

Going full std::ranges

Prefer <cstring> to <string.h>

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Avoid using `char*`

Going full `std::ranges`

Prefer `<cstring>` to `<string.h>`