Could you please conduct code review for the code below and suggest some improvements?
Functional specification
Implement a function for fast tokenization of text in char[] buffer handling some natural language specifics below:
- Consider ‘ ‘ (space) as a delimiter, keeping a way to extends the list of delimiters later.
- Extract stable collocations like "i.e.", "etc.", "..." as a single lexem.
- In case word contains characters like ‘-‘ and ‘’’ (examples: semi-column, cat’s) return the whole construct as a whole lexem.
- Return sequences of numbers (integers without signs) as a single lexem.
Performance is critical, since the amount of data is huge. The function must be thread-safe.
Design
Since performance is critical, the function work with raw pointers to const char. It gets the argument as the reference to the pointer to the place where from it should start parsing and updates this pointer, moving to the position past the read lexem.
Since initial characters could be delimiters, function returns the real starting position of the lexem found.
Concerns on the current implementation
Most likely, for such tasks the regex library should be used, but I am not sure if this "write-only" language (regex) (for me at least; you write once and can’t read and maintain it at all, rewriting from scratch every time) will be extendable when new requirements come. If I am wrong, I will be thankful for the maintainable version with the regex.
Another concern on the regex usage is performance. The code is expected to work with locales and this could be quite slow if underlying regex implementation somehow uses isalpha, etc.
I feel that with std::ranges
this could be implemented simpler, so is case of any suggestions, please, share.
I feel that main loop could be simplified and nested loop at the end could be removed, but can't find better solution for now.
With all these static vectors and intention to make it configurable and extendable in future, I am in two minds if to make a class or namespace from this in order to being able to configure with delimiters, stable lexems, inwords lexems, etc. Would it be extendability or overengineering?
The code
The fully functional demo
#include <algorithm>
#include <iostream>
#include <vector>
#include <string.h>
// Returs lexem start point or nullptr if lexem not found
// Moves the passed pointer to the position past the lexem
inline const char* get_lexem(const char*& p)
{
const static std::vector delimiters = { ' ' }; // Could be extened to many different delimiters
const static std::vector<const char*> stable_lexems = { "i.e.", "etc.", "..." }; // Planned to be externally configurable
const static std::vector<char> inword_lexems = { '-', '\'' }; // Not sure how to process this better
const char* start = p;
while (*p && p == start) {
while (delimiters.end() != std::find(delimiters.begin(), delimiters.end(), *p)) {
++p;
if (!*p)
return nullptr;
}
auto it = std::find_if(stable_lexems.begin(), stable_lexems.end(), [&](const char* lexem) {
size_t length = strlen(lexem);
return !strncmp(p, lexem, length);
});
start = p;
if (it != stable_lexems.end()) {
p += strlen(*it);
return start;
}
while (*p && (delimiters.end() == find(delimiters.begin(), delimiters.end(), *p))) {
const bool is_inword_char = inword_lexems.end() != std::find(inword_lexems.begin(), inword_lexems.end(), *p);
if (is_inword_char && p != start && isalpha(*(p - 1))) {
++p;
continue;
}
if (!isalpha(*p) && !isdigit(*p)) {
if (p == start) {
++p;
}
break;
}
++p;
}
}
return start;
}
int main()
{
const char sample[] = "Let's conisder this semi-simple sample, i.e. test data with ints: 100, etc. For ... some testing...";
const char* lexem = nullptr;
const char* lexem_end = sample;
while (true) {
lexem = get_lexem(lexem_end);
if (!(lexem && lexem != lexem_end))
break;
std::string token(lexem, lexem_end - lexem);
std::cout << token << "\n";
}
}
2 Answers 2
Avoid using char*
C++ provides much nicer and safer ways to deal with strings and string slices than C's char*
. Before C++17, I recommend you just use std::string
where possible. This might cause some unnecessary copies to be made though. Luckily, since C++17 we have std::string_view
; this is just a view of another string, it doesn't hold a copy itself.
Make it work like a range
Wouldn't it be nice if the code in main()
could be written like this?
int main() {
std::string sample = "Let's consider...";
for (auto token: tokenize(sample)) {
std::cout << token << '\n';
}
}
You can do this by making a function tokenize()
that returns a class
that has two functions, begin()
and end()
, that both return a token iterator. Here is a possible way to do this:
class TokenRange {
std::string_view data;
public:
class Iterator {
std::string_view data;
public:
Iterator(std::string_view data = {}): data(data) {}
Iterator& operator++();
std::string_view operator*() const;
friend bool operator==(const Iterator&, const Iterator&) = default;
};
TokenRange(std::string_view data): data(data) {}
Iterator begin() {
return Iterator(data);
}
Iterator end() {
return {};
}
};
So now if you write tokenize(sample).begin()
, you get a TokenRange::Iterator
whose member data
is a view of the whole string sample
. Now if you try to dereference that iterator using *
, then you expect it to return the value of the first item in the range, or in this case, the first token. That's most of what you do in get_lexem()
. So get_lexem()
can be turned into TokenRange::Iterator::operator*()
.
Of course, operator++()
will probably be called soon afterwards, so you want to make sure you don't duplicate most of operator*()
just to know how much you have to skip over. Find some way to only have to scan for a token once.
While std::string_view
is more than just a pointer, and you might even need more member variables in Iterator
, any decent compiler will inline all these things and optimize them away.
Making a standards-compliant iterator is a little bit more work, but you can inherit from std::iterator
. Also see this tutorial.
Consider making use of more recent C++ features
Instead of std::find(delimiters.begin(), delimiters.end(), *p)
, you could write std::ranges::find(delimiters, *p)
instead. Or even better:
while (std::ranges::contains(delimiters, *p)) {
++p;
}
if (!*p) {
...
}
Use std::string_view
's compare()
member function instead of strncmp()
, and of course size()
insead of strlen()
. Or if you just want to check if a string begins with a given text, then use starts_with()
.
Going full std::ranges
It sounds like this should be a problem that could be solved with std::ranges
. Something like:
auto tokens = data
| std::views::lazy_split(delimiters)
| ...;
However, your definition of a token is complex enough that writing it out using just views would either result in some horrible looking code, or it might be very inefficient. If I would go this route, then I would start with a function that maps each charater to some enum that describes which class it belongs to: delimiter, in-word lexem, alphanumerics and other. Then you can use other views to split on delimiters, and use something like std::views::adjacent_transform()
to eliminate in-word lexems at the start of a token. Still, checking for stable lexems seems to be hard.
I would probably keep the structure of get_lexem()
. However, just like you can make a tokenize()
function that returns a range, you could make your own tokenize_view()
so you can write:
auto tokens = data | tokenize_view();
The main advantage of that would be that you could then use that inside a larger pipeline of views.
-
\$\begingroup\$ Thank you so much for your help here. As was really surprised how fast you've done this code review; it was like you had this implemented before :). I have updated the code according to your recommendations and posted Rev2. Could you please check if my understanding is correct? And about
char *
vsstd::string_view
, shame on me, I stepped on this already second time; you already caught this on previous reviews. My concern was performance, but now I see (see in rev2) that this is not the case. \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月11日 16:21:07 +00:00Commented Feb 11, 2024 at 16:21 -
\$\begingroup\$ To keep you posted, I've published the Rev.3. Could you please take a look? \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月13日 13:02:57 +00:00Commented Feb 13, 2024 at 13:02
Prefer <cstring>
to <string.h>
The C compatibility headers should only be used for extern "C"
interfaces. For C++ programs, use the C++ headers, which definitely declare identifiers in the std
namespace (e.g. std::size_t
, std::strncmp
, std::strlen
).
Similarly, we should include <cctype>
to declare std::isalpha
and std::isdigit
. That said, for a programming language parser we probably want to use the <locales>
version of these, passing std::locale::classic()
to avoid using the unknown global locale. That would neatly avoid the bug we have here, where we pass these functions a plain char
rather than the appropriate positive int
.
-
\$\begingroup\$ Thank you for the input. On your points: (1) C++ headers (including <ctype>), indeed, better, will follow, but for the moment at least with MSVS 2022 it is enough to include <iostream> here and with the long mess in these headers, you will get all of them; so this makes sence only with lightweight headers; in my case <string.h> is a legacy from times only CRT exsited, will remove, of course; (2) taking into account the way both
size_t
andstd::size_t
defined, I would stay with shorter version so far; \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月11日 12:54:51 +00:00Commented Feb 11, 2024 at 12:54 -
\$\begingroup\$ (3) not sure what do you mean with passing the locale (
std::locale::classic()
); do you mean you conisder passing it as a second argument for calls? (4) Where do you see a bug with passingchar
instead ofint
? Don't implicit coversions exist? \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月11日 12:55:00 +00:00Commented Feb 11, 2024 at 12:55 -
1\$\begingroup\$ The
<cctype>
functions take a positive int, even on platforms wherechar
is signed. It's a bug to pass a negative char to these functions, as it will be widened to a negativeint
. Instead we need to convert tounsigned char
so that widening toint
produces a positive value. And yes, the<locale>
functions such asstd::isalpha()
take the locale as second argument (and you can pass plainchar
as first argument, unlike the corresponding<cctype>
functions). \$\endgroup\$Toby Speight– Toby Speight2024年02月11日 13:25:05 +00:00Commented Feb 11, 2024 at 13:25 -
\$\begingroup\$ Thank you for the explanations on the
<cctype>
, will be on guard here. The problem with the second argument for locale I covered here (mentioned in the question and you participated in the discussion); it is quite slow, so anyway I will usefast_tolower
as a wrapper using your advice in both topics. I was in two minds if I need to overcomplicate code here with this. \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月11日 13:37:21 +00:00Commented Feb 11, 2024 at 13:37 -
\$\begingroup\$ Thank you so much for your help here. I've updated the code taking into account your recommendations and posted Rev2. Could you please revisit it? \$\endgroup\$Damir Tenishev– Damir Tenishev2024年02月11日 16:23:45 +00:00Commented Feb 11, 2024 at 16:23
std::string
andchar *
. Yoursample
is declared instd::string
, why not keep it align withconst char*
? \$\endgroup\$