Below is a function template that tokenizes a given std::basic_string_view
using a given delimiter and assigns the tokens to a buffer (via a std::span
).
Its main advantage is that it doesn't construct a vector<string_view>
and return it. Instead, it assigns each found token to an element of the span parameter. And the client needs to tell the function how many tokens they expect to get by passing a span object of that exact size (say I expect 5 tokens to be found, I need to pass a span with size 5). This also ensures that the function does not overrun the underlying buffer by assigning an excess number of tokens to the elements passed the end (which is UB).
This obviously has some niche use cases.
Here (live):
#include <string_view>
#include <span>
#include <limits>
#include <cstddef>
#include <fmt/core.h>
#define METHOD 1
using std::size_t;
template < typename CharT,
class Traits = std::char_traits<CharT> >
[[ nodiscard ]] size_t constexpr
tokenize( const std::basic_string_view<CharT, Traits> str,
const std::basic_string_view<CharT, Traits> delimiter,
const std::span< std::basic_string_view<CharT, Traits> > found_tokens_OUT ) noexcept
{
auto found_tokens_count { 0uz };
size_t start { str.find_first_not_of( delimiter ) };
size_t end { };
#if METHOD == 1
for ( auto idx { 0uz };
start != std::string_view::npos && idx < std::size( found_tokens_OUT ); ++idx )
{
end = str.find_first_of( delimiter, start );
found_tokens_OUT[ idx ] = str.substr( start, end - start );
++found_tokens_count;
start = str.find_first_not_of( delimiter, end );
}
#else
for ( auto&& token : found_tokens_OUT )
{
if ( start == std::basic_string_view<CharT, Traits>::npos ) break;
end = str.find_first_of( delimiter, start );
token = str.substr( start, end - start );
++found_tokens_count;
start = str.find_first_not_of( delimiter, end );
}
#endif
if ( start == std::basic_string_view<CharT, Traits>::npos )
return found_tokens_count;
else
return found_tokens_count = std::numeric_limits<size_t>::max( );
}
int main( )
{
using std::string_view_literals::operator""sv;
const auto str { "1 % "sv };
constexpr auto delimiter { " \t"sv };
std::array<std::string_view, 5> tokens;
const auto token_count { tokenize( str, delimiter, { std::begin( tokens ), 2 } ) };
if ( token_count != std::numeric_limits<size_t>::max( ) )
{
for ( auto idx { 0uz }; idx < token_count; ++idx )
{
fmt::print( "{} ", tokens[ idx ] );
}
}
fmt::print( "\nCount: {}\n", token_count );
}
Questions about API design:
- Is this function easy to use and hard to misuse? For example, the user has to construct and pass the 3rd argument (the
span
) by paying attention to its size because the function will expect to findspan.size()
number of tokens. - In case the function finds out that there is at least one more token than the user expected (the size of the span), then it stops the tokenization process and returns a sentinel value (i.e. MAX of
size_t
).
Questions about implementation:
- Inside the body, I have written two different loops that do exactly the same thing. Which one is better? I find the second one (i.e. METHOD 2) more readable.
- Can any of the loops be simplified even more?
And finally what else could be improved?
1 Answer 1
The span for output is somewhat inflexible:
- it requires the caller to anticipate the number of results, and
- it requires contiguous storage, restricting the implementation.
It's probably better to make the function return an Input Range, enabling more flexible calling (e.g. I can then create a linked list of the results without overhead by using std::ranges::copy(tokenize(...), std::back_inserter(my_list));
, or I could use a View to filter the results).
An alternative that's more like the current approach would be to accept an Output Range - possibly an infinite range such as a stream output range or a collection inserter.
Without that change, I think it would be more useful to operate like std::snprintf()
by returning the number of tokens that would have been produced if the output range was large enough. That allows callers to call again with a resized buffer if they need to.
I prefer the version with the range-based for
, but that's probably irrelevant given that changes to the interface could well make that look completely different.
A minor observation: this assignment is pointless, since found_tokens_count
is going out of scope anyway:
return found_tokens_count = std::numeric_limits<size_t>::max( );
That can be replaced with simply
return std::numeric_limits<size_t>::max();
-
\$\begingroup\$ That's surely a lot of code, isn't it? Do you know where should I look for if I want to go that road? \$\endgroup\$digito_evo– digito_evo2023年05月24日 16:55:02 +00:00Commented May 24, 2023 at 16:55
-
\$\begingroup\$ It's more code than you have, sure. I'd guess that the extra code is probably worth its weight in extra usability, but that's obviously a subjective matter. \$\endgroup\$Toby Speight– Toby Speight2023年05月24日 20:09:52 +00:00Commented May 24, 2023 at 20:09
-
1\$\begingroup\$ It's interesting enough that I'm writing a version based on my suggestions. If I get time to complete it, I'll post it for review. \$\endgroup\$Toby Speight– Toby Speight2023年05月25日 06:46:15 +00:00Commented May 25, 2023 at 6:46
-
1\$\begingroup\$ Tokeniser which yields a Range of string views. \$\endgroup\$Toby Speight– Toby Speight2023年05月25日 18:44:08 +00:00Commented May 25, 2023 at 18:44
Explore related questions
See similar questions with these tags.
string_view
on each iteration. Let user figure out how to store it - or just perform an action on each element. \$\endgroup\$