string_view tokenizer function template

Question 1

Below is a function template that tokenizes a given std::basic_string_view using a given delimiter and assigns the tokens to a buffer (via a std::span).

Its main advantage is that it doesn't construct a vector<string_view> and return it. Instead, it assigns each found token to an element of the span parameter. And the client needs to tell the function how many tokens they expect to get by passing a span object of that exact size (say I expect 5 tokens to be found, I need to pass a span with size 5). This also ensures that the function does not overrun the underlying buffer by assigning an excess number of tokens to the elements passed the end (which is UB).

This obviously has some niche use cases.

Here (live):

#include <string_view>
#include <span>
#include <limits>
#include <cstddef>
#include <fmt/core.h>
#define METHOD 1
using std::size_t;
template < typename CharT,
 class Traits = std::char_traits<CharT> >
[[ nodiscard ]] size_t constexpr
tokenize( const std::basic_string_view<CharT, Traits> str,
 const std::basic_string_view<CharT, Traits> delimiter,
 const std::span< std::basic_string_view<CharT, Traits> > found_tokens_OUT ) noexcept
{
 auto found_tokens_count { 0uz };
 size_t start { str.find_first_not_of( delimiter ) };
 size_t end { };
#if METHOD == 1
 for ( auto idx { 0uz };
 start != std::string_view::npos && idx < std::size( found_tokens_OUT ); ++idx )
 {
 end = str.find_first_of( delimiter, start );
 found_tokens_OUT[ idx ] = str.substr( start, end - start );
 ++found_tokens_count;
 start = str.find_first_not_of( delimiter, end );
 }
#else
 for ( auto&& token : found_tokens_OUT )
 {
 if ( start == std::basic_string_view<CharT, Traits>::npos ) break;
 end = str.find_first_of( delimiter, start );
 token = str.substr( start, end - start );
 ++found_tokens_count;
 start = str.find_first_not_of( delimiter, end );
 }
#endif
 if ( start == std::basic_string_view<CharT, Traits>::npos )
 return found_tokens_count;
 else
 return found_tokens_count = std::numeric_limits<size_t>::max( );
}
int main( )
{
 using std::string_view_literals::operator""sv;
 const auto str { "1 % "sv };
 constexpr auto delimiter { " \t"sv };
 std::array<std::string_view, 5> tokens;
 const auto token_count { tokenize( str, delimiter, { std::begin( tokens ), 2 } ) };
 if ( token_count != std::numeric_limits<size_t>::max( ) )
 {
 for ( auto idx { 0uz }; idx < token_count; ++idx )
 {
 fmt::print( "{} ", tokens[ idx ] );
 }
 }
 fmt::print( "\nCount: {}\n", token_count );
}

Questions about API design:

Is this function easy to use and hard to misuse? For example, the user has to construct and pass the 3rd argument (the span) by paying attention to its size because the function will expect to find span.size() number of tokens.
In case the function finds out that there is at least one more token than the user expected (the size of the span), then it stops the tokenization process and returns a sentinel value (i.e. MAX of size_t).

Questions about implementation:

Inside the body, I have written two different loops that do exactly the same thing. Which one is better? I find the second one (i.e. METHOD 2) more readable.
Can any of the loops be simplified even more?

And finally what else could be improved?

Question 2

Rather rather providing tokenizer that outputs vector or span, you'd better make an iteratable. One that returns the next string_view on each iteration. Let user figure out how to store it - or just perform an action on each element.

Question 3

@ALX23z Sounds Ok. But I've never done that.

Question 4

The span for output is somewhat inflexible:

it requires the caller to anticipate the number of results, and
it requires contiguous storage, restricting the implementation.

It's probably better to make the function return an Input Range, enabling more flexible calling (e.g. I can then create a linked list of the results without overhead by using std::ranges::copy(tokenize(...), std::back_inserter(my_list));, or I could use a View to filter the results).

An alternative that's more like the current approach would be to accept an Output Range - possibly an infinite range such as a stream output range or a collection inserter.

Without that change, I think it would be more useful to operate like std::snprintf() by returning the number of tokens that would have been produced if the output range was large enough. That allows callers to call again with a resized buffer if they need to.

I prefer the version with the range-based for, but that's probably irrelevant given that changes to the interface could well make that look completely different.

A minor observation: this assignment is pointless, since found_tokens_count is going out of scope anyway:

 return found_tokens_count = std::numeric_limits<size_t>::max( );

That can be replaced with simply

 return std::numeric_limits<size_t>::max();

Question 5

That's surely a lot of code, isn't it? Do you know where should I look for if I want to go that road?

Question 6

It's more code than you have, sure. I'd guess that the extra code is probably worth its weight in extra usability, but that's obviously a subjective matter.

Question 7

It's interesting enough that I'm writing a version based on my suggestions. If I get time to complete it, I'll post it for review.

Question 8

Tokeniser which yields a Range of string views.

score 5 · Accepted Answer · 2023-05-24 16:08:18Z

The span for output is somewhat inflexible:

it requires the caller to anticipate the number of results, and
it requires contiguous storage, restricting the implementation.

It's probably better to make the function return an Input Range, enabling more flexible calling (e.g. I can then create a linked list of the results without overhead by using std::ranges::copy(tokenize(...), std::back_inserter(my_list));, or I could use a View to filter the results).

An alternative that's more like the current approach would be to accept an Output Range - possibly an infinite range such as a stream output range or a collection inserter.

Without that change, I think it would be more useful to operate like std::snprintf() by returning the number of tokens that would have been produced if the output range was large enough. That allows callers to call again with a resized buffer if they need to.

I prefer the version with the range-based for, but that's probably irrelevant given that changes to the interface could well make that look completely different.

A minor observation: this assignment is pointless, since found_tokens_count is going out of scope anyway:

 return found_tokens_count = std::numeric_limits<size_t>::max( );

That can be replaced with simply

 return std::numeric_limits<size_t>::max();

That's surely a lot of code, isn't it? Do you know where should I look for if I want to go that road?
It's more code than you have, sure. I'd guess that the extra code is probably worth its weight in extra usability, but that's obviously a subjective matter.
It's interesting enough that I'm writing a version based on my suggestions. If I get time to complete it, I'll post it for review.

Stack Exchange Network

string_view tokenizer function template

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

string_view tokenizer function template

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions