A codepoint in UTF-8 can be represented by one to four bytes. Given a codepoint, I need to determine the length (in bytes) of the codepoint if it were represented in UTF-8. For this, I've written the following function:
char utf8_char_size(uint32_t ch) {
if (ch < 0x80) return 1;
if (ch < 0x800) return 2;
if (ch < 0x10000) return 3;
return 4;
}
It works, and it's pretty simple to understand. However, I'm afraid that branching three times per codepoint could lead to a large amount of branch mispredictions which would slow my code down tremendously. Is this code fine?
2 Answers 2
Since C++ types are declared in the std
namespace, we should be using std::uint32_t
here - or better still char32_t
- rather than non-portably relying on the implementation also declaring uint32_t
in the global namespace.
Returning a char
seems suspect, as we'll be returning a small integer value rather than a meaningful character code.
Other than that, the implementation is clear enough. We might rephrase as
return 1
+ (ch >= 0x80)
+ (ch >= 0x800)
+ (ch >= 0x10000);
in the hope of reducing branches, but I'm not sure a decent optimising compiler would produce different code. You should inspect and profile the generated object code before choosing to make the code less clear.
An alternative on platforms with a quick std::countl_zero()
would be to use that value in a switch
or to index an array of resultant lengths. Again, a good optimiser should do this for you with the clear, readable source of the question.
-
\$\begingroup\$ Unlike
std::uint32_t
,char32_t
is a built in type, not a typedef defined instd
namespace \$\endgroup\$MatG– MatG2023年11月30日 07:39:38 +00:00Commented Nov 30, 2023 at 7:39
Use Appropriate Types
The type of a UCS-4 codepoint should be char32_t
. For a small unsigned value, I would use uint8_t
—especially if you’re also going to be working with a UTF-8 string of 8-bit characters.
In practice, this doesn’t matter much, because the types implicitly convert to each other when you use this function in an expression.
Use constexpr
and noexcept
Modern compilers seem to be able to optimize a function this simple without being told, but this will protect maintainers from making some changes that will slow things down.
Avoid if
Toby Speight already gave you a good branchless implementation. The approach I more often use for code that I want to auto-vectorize, because it’s more general, is this:
constexpr uint8_t utf8_unit_size(const char32_t c) noexcept
{
return (c < 0x80) ? 1 :
(c < 0x800) ? 2 :
(c < 0x10000) ? 3 :
4;
}
In a case like this, it will produce the same code as yours, but the great advantage of it is that the compiler will only let you use constexpr
expressions of the same type in each branch. And modern compilers are very good at optimizing code that follows those rules.
In practice, Toby Speight’s version does optimize better on modern optimizing compilers. Using the flags -std=C++20 -O3 -march=x86-64-v4
, Clang 16 compiles the version using addition to
mov edx, dword ptr [rax]
xor esi, esi
cmp edx, 127
seta sil
cmp edx, 2048
sbb rbx, -1
add rbx, rsi
cmp edx, 65536
sbb rbx, -1
inc rbx
The optimizer is turning the additions into subtract-with borrow instructions (sbb rbx, -1
), which add the carry flag to the second operand. When the cmp
instruction sets the carry, it therefore subtracts (-1 + 1), the same as adding 0. When the cmp
instruction clears the carry, it subtracts (-1 + 0), the same as adding 1.
However, the code using either if
statements or ternary ?
expressions produces conditional branches on GCC 13 or Clang 16. ICX 2022 compiles them into conditional moves. The version that adds relational expressions produces branchless code, although it still depends on the flags. So it should be slightly better than if
/else
on ICX, but significantly better on GCC or Clang.
In practice, you are likely to be working with a lot of codepoints in the same range, so the branch predictor would probably figure out quickly that (for example) most codepoints it’s seeing are ASCII, a few are in Latin Extended, and almost none are outside the BMP.
Benchmark Performance in Loops
The only way this function will ever be performance-critical is if you use it in a loop. Which is how it would normally be used: you run functions like this on strings, not isolated characters.
Here is a more realistic example of how you would be using this: a template function to calculate the length of the UTF-8 conversion of any container holding a type compatible with char32_t
: In other words, it works on anything wide-stringy: a std::u32string
, a std::vector<uint32_t>
, or for that matter, it would work on an unsigned char[BUF_SIZE]
holding ISO-8859-1 data.
#include <algorithm> // transform_reduce
#include <concepts> // convertible_to
#include <numeric> // plus
#include <ranges>
#include <stdint.h>
#include <stddef.h>
#include <string>
char utf8_char_size(uint32_t ch) {
if (ch < 0x80) return 1;
if (ch < 0x800) return 2;
if (ch < 0x10000) return 3;
return 4;
}
// Adaptation of Toby Speight's code:
constexpr uint8_t utf8_unit_size(const char32_t c) noexcept
{
return 1U
+ (c >= 0x80)
+ (c >= 0x800)
+ (c >= 0x10000);
}
template<class T>
requires std::ranges::input_range<T> &&
std::convertible_to<std::ranges::range_value_t<T>, char32_t>
constexpr size_t utf8_str_size(const T& s) noexcept
{
return std::transform_reduce(
std::ranges::begin(s),
std::ranges::end(s),
(size_t)0,
std::plus<size_t>{},
utf8_unit_size
);
}
Using either your helper function or Toby Speight’s, modern compilers turn this into highly-optimized, vectorized code. You could even, with little additional effort, add a parallel execution policy. Here is the inner loop of the version calling your function, exactly as you wrote it, on GCC 16 with -std=c++20 -O3 -march=x86-64-v4
:
.LBB1_8: # =>This Inner Loop Header: Depth=1
vmovdqu ymm15, ymmword ptr [r9 - 192]
vmovdqu64 ymm22, ymmword ptr [r9 - 160]
vmovdqu ymm0, ymmword ptr [r9 - 128]
vmovdqu64 ymm23, ymmword ptr [r9 - 96]
vmovdqa64 ymm31, ymm15
vmovdqa xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [0,4,8,12]
vmovdqa64 ymm21, ymm1
vpermt2d ymm31, ymm1, ymm22
vmovdqa64 ymm16, ymm0
vpermt2d ymm16, ymm1, ymm23
vmovdqu ymm1, ymmword ptr [r9 - 32]
vmovdqu ymm13, ymmword ptr [r9 - 64]
vmovdqa64 ymm29, ymm13
vmovdqu ymm2, ymmword ptr [r9 + 32]
vmovdqu ymm14, ymmword ptr [r9]
vpermt2d ymm29, ymm21, ymm1
vmovdqa64 ymm30, ymm14
vpermt2d ymm30, ymm21, ymm2
vmovdqa64 ymm25, ymm15
vmovdqa64 xmm21, xmmword ptr [rip + .LCPI1_1] # xmm21 = [1,5,9,13]
vpermt2d ymm25, ymm21, ymm22
vmovdqa64 ymm26, ymm0
vpermt2d ymm26, ymm21, ymm23
vmovdqa64 ymm27, ymm13
vpermt2d ymm27, ymm21, ymm1
vmovdqa64 ymm28, ymm14
vpermt2d ymm28, ymm21, ymm2
vmovdqa64 ymm21, ymm15
vmovdqa xmm3, xmmword ptr [rip + .LCPI1_2] # xmm3 = [2,6,10,14]
vpermt2d ymm21, ymm3, ymm22
vmovdqa xmm12, xmmword ptr [rip + .LCPI1_3] # xmm12 = [3,7,11,15]
vpermt2d ymm15, ymm12, ymm22
vmovdqa64 ymm24, ymm0
vpermt2d ymm24, ymm3, ymm23
vpermt2d ymm0, ymm12, ymm23
vmovdqa64 ymm22, ymm13
vpermt2d ymm22, ymm3, ymm1
vpermt2d ymm13, ymm12, ymm1
vmovdqa64 ymm23, ymm14
vpermt2d ymm23, ymm3, ymm2
vpermt2d ymm14, ymm12, ymm2
vpcmpltud k2, xmm31, xmm4
vpcmpltud k1, xmm16, xmm4
kmovw word ptr [rsp - 26], k1 # 2-byte Spill
vpcmpltud k4, xmm31, xmm5
vpcmpltud k3, xmm16, xmm5
vpaddd xmm1, xmm31, xmm8
vpcmpltud k5, xmm1, xmm9
vpblendmq ymm1 {k4}, ymm7, ymm6
vpaddd xmm2, xmm16, xmm8
vpcmpltud k4, xmm2, xmm9
vpblendmq ymm2 {k5}, ymm11, ymm10
vmovdqa64 ymm1 {k2}, ymm2
vpcmpltud k2, xmm29, xmm4
vpcmpltud k5, xmm30, xmm4
vpcmpltud k6, xmm29, xmm5
vpcmpltud k7, xmm30, xmm5
vpaddd xmm2, xmm29, xmm8
vpcmpltud k1, xmm2, xmm9
vpblendmq ymm2 {k3}, ymm7, ymm6
vpaddd xmm16, xmm30, xmm8
vpcmpltud k3, xmm16, xmm9
vpaddq ymm30, ymm1, ymm17
vpblendmq ymm1 {k4}, ymm11, ymm10
kmovw k4, word ptr [rsp - 26] # 2-byte Reload
vmovdqa64 ymm2 {k4}, ymm1
vpblendmq ymm1 {k6}, ymm7, ymm6
vpaddq ymm29, ymm2, ymm18
vpblendmq ymm2 {k1}, ymm11, ymm10
vmovdqa64 ymm1 {k2}, ymm2
vpblendmq ymm2 {k7}, ymm7, ymm6
vpaddq ymm18, ymm1, ymm19
vpblendmq ymm1 {k3}, ymm11, ymm10
vmovdqa64 ymm2 {k5}, ymm1
vpaddq ymm17, ymm2, ymm20
vpcmpltud k4, xmm25, xmm4
vpcmpltud k3, xmm26, xmm4
vpcmpltud k2, xmm27, xmm4
vpcmpltud k1, xmm28, xmm4
kmovw word ptr [rsp - 26], k1 # 2-byte Spill
vpcmpltud k5, xmm25, xmm5
vpcmpltud k6, xmm26, xmm5
vpcmpltud k7, xmm27, xmm5
vpcmpltud k1, xmm28, xmm5
vpblendmq ymm1 {k5}, ymm7, ymm6
vpaddd xmm2, xmm25, xmm8
vpcmpltud k5, xmm2, xmm9
vpblendmq ymm2 {k6}, ymm7, ymm6
vpaddd xmm16, xmm26, xmm8
vpcmpltud k6, xmm16, xmm9
vpblendmq ymm16 {k7}, ymm7, ymm6
vpaddd xmm19, xmm27, xmm8
vpcmpltud k7, xmm19, xmm9
vpblendmq ymm19 {k1}, ymm7, ymm6
vpaddd xmm20, xmm28, xmm8
vpcmpltud k1, xmm20, xmm9
vpblendmq ymm20 {k5}, ymm11, ymm10
vmovdqa64 ymm1 {k4}, ymm20
vpblendmq ymm20 {k6}, ymm11, ymm10
vmovdqa64 ymm2 {k3}, ymm20
vpblendmq ymm20 {k7}, ymm11, ymm10
vmovdqa64 ymm16 {k2}, ymm20
vpblendmq ymm20 {k1}, ymm11, ymm10
kmovw k1, word ptr [rsp - 26] # 2-byte Reload
vmovdqa64 ymm19 {k1}, ymm20
vpcmpltud k2, xmm21, xmm4
vpcmpltud k1, xmm24, xmm4
kmovw word ptr [rsp - 26], k1 # 2-byte Spill
vpcmpltud k4, xmm21, xmm5
vpcmpltud k3, xmm24, xmm5
vpaddd xmm20, xmm21, xmm8
vpcmpltud k5, xmm20, xmm9
vpblendmq ymm20 {k4}, ymm7, ymm6
vpaddd xmm21, xmm24, xmm8
vpcmpltud k4, xmm21, xmm9
vpblendmq ymm21 {k5}, ymm11, ymm10
vmovdqa64 ymm20 {k2}, ymm21
vpcmpltud k2, xmm22, xmm4
vpcmpltud k5, xmm23, xmm4
vpcmpltud k6, xmm22, xmm5
vpcmpltud k7, xmm23, xmm5
vpaddd xmm21, xmm22, xmm8
vpcmpltud k1, xmm21, xmm9
vpblendmq ymm21 {k3}, ymm7, ymm6
vpaddd xmm22, xmm23, xmm8
vpcmpltud k3, xmm22, xmm9
vpaddq ymm1, ymm1, ymm20
vpaddq ymm1, ymm30, ymm1
vpblendmq ymm20 {k4}, ymm11, ymm10
kmovw k4, word ptr [rsp - 26] # 2-byte Reload
vmovdqa64 ymm21 {k4}, ymm20
vpaddq ymm2, ymm2, ymm21
vpblendmq ymm20 {k6}, ymm7, ymm6
vpaddq ymm2, ymm29, ymm2
vpblendmq ymm21 {k1}, ymm11, ymm10
vmovdqa64 ymm20 {k2}, ymm21
vpaddq ymm16, ymm16, ymm20
vpblendmq ymm20 {k7}, ymm7, ymm6
vpaddq ymm16, ymm18, ymm16
vpblendmq ymm18 {k3}, ymm11, ymm10
vmovdqa64 ymm20 {k5}, ymm18
vpaddq ymm18, ymm19, ymm20
vpaddq ymm20, ymm17, ymm18
vmovdqa xmm3, xmmword ptr [rsp - 24] # 16-byte Reload
vpcmpnleud k2, xmm15, xmm3
vpcmpnleud k1, xmm0, xmm3
kmovw word ptr [rsp - 26], k1 # 2-byte Spill
vpcmpltud k4, xmm15, xmm5
vpcmpltud k3, xmm0, xmm5
vpaddd xmm15, xmm15, xmm8
vpcmpltud k5, xmm15, xmm9
vpblendmq ymm15 {k4}, ymm7, ymm6
vpaddd xmm0, xmm8, xmm0
vpcmpltud k4, xmm0, xmm9
vpblendmq ymm0 {k5}, ymm11, ymm10
vmovdqa64 ymm0 {k2}, ymm15
vpcmpnleud k2, xmm13, xmm3
vpcmpnleud k5, xmm14, xmm3
vpcmpltud k6, xmm13, xmm5
vpcmpltud k7, xmm14, xmm5
vpaddd xmm13, xmm13, xmm8
vpcmpltud k1, xmm13, xmm9
vpblendmq ymm13 {k3}, ymm7, ymm6
vpaddd xmm14, xmm14, xmm8
vpcmpltud k3, xmm14, xmm9
vpaddq ymm17, ymm1, ymm0
vpblendmq ymm0 {k4}, ymm11, ymm10
kmovw k4, word ptr [rsp - 26] # 2-byte Reload
vmovdqa64 ymm0 {k4}, ymm13
vpblendmq ymm1 {k6}, ymm7, ymm6
vpaddq ymm18, ymm2, ymm0
vpblendmq ymm0 {k1}, ymm11, ymm10
vmovdqa64 ymm0 {k2}, ymm1
vpblendmq ymm1 {k7}, ymm7, ymm6
vpaddq ymm19, ymm16, ymm0
vpblendmq ymm0 {k3}, ymm11, ymm10
vmovdqa64 ymm0 {k5}, ymm1
vpaddq ymm20, ymm20, ymm0
add r9, 256
add r10, -16
jne .LBB1_8
This isn’t too bad: the compiler was able to turn utf8_str_size<u32string>
into branchless vectorized code using the full-width registers. You can see that sometimes it has to fall back on 128-bit instead of 256-bit instructions, and sometimes it runs out of registers and needs to spill.
Using Toby Speight’s version as the transformation function gives us this output instead:
.LBB1_8: # =>This Inner Loop Header: Depth=1
vmovdqu ymm12, ymmword ptr [rax - 64]
vmovdqu ymm13, ymmword ptr [rax - 32]
vmovdqu ymm11, ymmword ptr [rax]
vmovdqu ymm14, ymmword ptr [rax + 32]
vmovdqa64 ymm17, ymm12
vmovdqa xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [0,4,8,12]
vpermt2d ymm17, ymm0, ymm13
vmovdqa64 ymm18, ymm11
vpermt2d ymm18, ymm0, ymm14
vmovdqa64 ymm19, ymm12
vmovdqa xmm0, xmmword ptr [rip + .LCPI1_1] # xmm0 = [1,5,9,13]
vpermt2d ymm19, ymm0, ymm13
vmovdqa64 ymm20, ymm11
vpermt2d ymm20, ymm0, ymm14
vmovdqa ymm15, ymm12
vmovdqa xmm0, xmmword ptr [rip + .LCPI1_2] # xmm0 = [2,6,10,14]
vpermt2d ymm15, ymm0, ymm13
vmovdqa64 ymm16, ymm11
vpermt2d ymm16, ymm0, ymm14
vpermt2d ymm12, ymm3, ymm13
vpermt2d ymm11, ymm3, ymm14
vpcmpnleud k1, xmm17, xmm4
vpblendmq ymm13 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm18, xmm4
vpblendmq ymm14 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm19, xmm4
vpblendmq ymm21 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm20, xmm4
vpblendmq ymm22 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm15, xmm4
vpblendmq ymm23 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm16, xmm4
vpblendmq ymm24 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm12, xmm4
vpblendmq ymm25 {k1}, ymm6, ymm5
vpcmpnleud k1, xmm11, xmm4
vpblendmq ymm26 {k1}, ymm6, ymm5
vpcmpnleud k0, xmm17, xmm7
vpmovm2q ymm27, k0
vpcmpnleud k0, xmm18, xmm7
vpmovm2q ymm28, k0
vpcmpnleud k0, xmm17, xmm8
vpmovm2q ymm17, k0
vpcmpnleud k0, xmm18, xmm8
vpmovm2q ymm18, k0
vpcmpnleud k0, xmm19, xmm7
vpmovm2q ymm29, k0
vpcmpnleud k0, xmm20, xmm7
vpmovm2q ymm30, k0
vpcmpnleud k0, xmm19, xmm8
vpmovm2q ymm19, k0
vpcmpnleud k0, xmm20, xmm8
vpmovm2q ymm20, k0
vpcmpnleud k0, xmm15, xmm7
vpmovm2q ymm31, k0
vpcmpnleud k0, xmm16, xmm7
vpmovm2q ymm0, k0
vpcmpnleud k0, xmm15, xmm8
vpmovm2q ymm15, k0
vpcmpnleud k0, xmm16, xmm8
vpmovm2q ymm16, k0
vpcmpnleud k0, xmm12, xmm7
vpmovm2q ymm1, k0
vpcmpnleud k0, xmm11, xmm7
vpmovm2q ymm2, k0
vpcmpnleud k0, xmm12, xmm8
vpmovm2q ymm12, k0
vpcmpnleud k0, xmm11, xmm8
vpmovm2q ymm11, k0
vpsubq ymm9, ymm9, ymm27
vpaddq ymm9, ymm9, ymm13
vpsubq ymm10, ymm10, ymm28
vpaddq ymm10, ymm10, ymm14
vpsubq ymm9, ymm9, ymm17
vpsubq ymm10, ymm10, ymm18
vpsubq ymm9, ymm9, ymm29
vpaddq ymm9, ymm9, ymm21
vpsubq ymm10, ymm10, ymm30
vpaddq ymm10, ymm10, ymm22
vpsubq ymm9, ymm9, ymm19
vpsubq ymm10, ymm10, ymm20
vpsubq ymm9, ymm9, ymm31
vpaddq ymm9, ymm9, ymm23
vpsubq ymm0, ymm10, ymm0
vpaddq ymm0, ymm0, ymm24
vpsubq ymm9, ymm9, ymm15
vpsubq ymm0, ymm0, ymm16
vpsubq ymm1, ymm9, ymm1
vpaddq ymm1, ymm1, ymm25
vpsubq ymm0, ymm0, ymm2
vpaddq ymm0, ymm0, ymm26
vpsubq ymm9, ymm1, ymm12
vpsubq ymm10, ymm0, ymm11
sub rax, -128
add r10, -8
jne .LBB1_8
This version needs fewer instructions inside the same loop (unrolled to operate on 32 words per iteration). Because it uses registers more efficiently, it does not need to spill any onto the stack. The additions are again transformed into subtractions, but this time, it’s because the vector comparisons turn true
into the bitmask 0xFFFFFFFF
, which is -1L
.
Finally, a link to the code on Godbolt.
-
\$\begingroup\$ Very informative, but it seems you did not want to explicitly time things (understandably). Maybe a bit more story would have made it even nicer. \$\endgroup\$Joop Eggen– Joop Eggen2023年05月24日 21:03:09 +00:00Commented May 24, 2023 at 21:03
-
1\$\begingroup\$ @JoopEggen Find some extremely large multilingual text? And by extremely large, I mean vectorized code like this will breeze through megabytes in milliseconds, so time a loop that runs them a thousand or a million times. \$\endgroup\$Davislor– Davislor2023年05月24日 22:03:27 +00:00Commented May 24, 2023 at 22:03
-
\$\begingroup\$ @JoopEggen And if you want it to be really fast, add an execution policy of
std::execution::par
, or rewrite as a loop with#pragma omp parallel for simd
, to enable both vectorization and multithreading. For best results, use ICX with-fiopenmp
\$\endgroup\$Davislor– Davislor2023年05月24日 22:05:43 +00:00Commented May 24, 2023 at 22:05 -
\$\begingroup\$ Probably better to use
<cstdint>
and<cstddef>
rather than the C versions of these headers (despite the implications of the code presented for review). The only valid use for the C headers is in header files to be included into programs written in both languages. \$\endgroup\$Toby Speight– Toby Speight2023年05月25日 06:08:23 +00:00Commented May 25, 2023 at 6:08 -
\$\begingroup\$ Thanks greatly for providing such detailed analysis, especially of my answer that I wrote really quickly. \$\endgroup\$Toby Speight– Toby Speight2023年05月25日 06:10:12 +00:00Commented May 25, 2023 at 6:10
return 1 + (ch >= 0x80) + (ch >= 0x800) + (ch >= 0x10000)
would be faster... \$\endgroup\$return ffs(ch)["1円1円1円1円1円1円1円2円2円2円2円2円2円2円2円3円3円3円3円3円3円3円3円3円4円4円4円4円4円4円4円4円"];
or perhaps:return 1 + ((0xffffaaaa95554000 >> 2 * ffs(ch)) & 2)
\$\endgroup\$ffs()
is POSIX, and indeed not in the standard library. With C++20 you could usestd::countl_zero()
to similar effect. \$\endgroup\$