3
\$\begingroup\$

A codepoint in UTF-8 can be represented by one to four bytes. Given a codepoint, I need to determine the length (in bytes) of the codepoint if it were represented in UTF-8. For this, I've written the following function:

char utf8_char_size(uint32_t ch) {
 if (ch < 0x80) return 1;
 if (ch < 0x800) return 2;
 if (ch < 0x10000) return 3;
 return 4;
}

It works, and it's pretty simple to understand. However, I'm afraid that branching three times per codepoint could lead to a large amount of branch mispredictions which would slow my code down tremendously. Is this code fine?

asked Dec 7, 2020 at 14:19
\$\endgroup\$
15
  • 2
    \$\begingroup\$ "slow my code down tremendously" - why? sure, branch mispredictions can be bad, but there are many other things that can blow you out of the water. Why do you think you need to optimize this tiny function first? \$\endgroup\$ Commented Dec 7, 2020 at 14:48
  • 2
    \$\begingroup\$ @Dannnno So you'd explicitly look at each bit separately? Then I suspect return 1 + (ch >= 0x80) + (ch >= 0x800) + (ch >= 0x10000) would be faster... \$\endgroup\$ Commented Dec 7, 2020 at 15:27
  • 2
    \$\begingroup\$ Another possibility: return ffs(ch)["1円1円1円1円1円1円1円2円2円2円2円2円2円2円2円3円3円3円3円3円3円3円3円3円4円4円4円4円4円4円4円4円"]; or perhaps: return 1 + ((0xffffaaaa95554000 >> 2 * ffs(ch)) & 2) \$\endgroup\$ Commented Dec 7, 2020 at 16:02
  • 2
    \$\begingroup\$ ffs() is POSIX, and indeed not in the standard library. With C++20 you could use std::countl_zero() to similar effect. \$\endgroup\$ Commented Dec 7, 2020 at 16:09
  • 2
    \$\begingroup\$ @AryanParekh Please do create an answer containing the various alternative solutions and the benchmark results. \$\endgroup\$ Commented Dec 7, 2020 at 16:11

2 Answers 2

5
\$\begingroup\$

Since C++ types are declared in the std namespace, we should be using std::uint32_t here - or better still char32_t - rather than non-portably relying on the implementation also declaring uint32_t in the global namespace.

Returning a char seems suspect, as we'll be returning a small integer value rather than a meaningful character code.

Other than that, the implementation is clear enough. We might rephrase as

return 1
 + (ch >= 0x80)
 + (ch >= 0x800)
 + (ch >= 0x10000);

in the hope of reducing branches, but I'm not sure a decent optimising compiler would produce different code. You should inspect and profile the generated object code before choosing to make the code less clear.

An alternative on platforms with a quick std::countl_zero() would be to use that value in a switch or to index an array of resultant lengths. Again, a good optimiser should do this for you with the clear, readable source of the question.

answered Dec 21, 2022 at 22:41
\$\endgroup\$
1
  • \$\begingroup\$ Unlike std::uint32_t, char32_t is a built in type, not a typedef defined in std namespace \$\endgroup\$ Commented Nov 30, 2023 at 7:39
4
\$\begingroup\$

Use Appropriate Types

The type of a UCS-4 codepoint should be char32_t. For a small unsigned value, I would use uint8_t—especially if you’re also going to be working with a UTF-8 string of 8-bit characters.

In practice, this doesn’t matter much, because the types implicitly convert to each other when you use this function in an expression.

Use constexpr and noexcept

Modern compilers seem to be able to optimize a function this simple without being told, but this will protect maintainers from making some changes that will slow things down.

Avoid if

Toby Speight already gave you a good branchless implementation. The approach I more often use for code that I want to auto-vectorize, because it’s more general, is this:

constexpr uint8_t utf8_unit_size(const char32_t c) noexcept
{
 return (c < 0x80) ? 1 :
 (c < 0x800) ? 2 :
 (c < 0x10000) ? 3 :
 4;
}

In a case like this, it will produce the same code as yours, but the great advantage of it is that the compiler will only let you use constexpr expressions of the same type in each branch. And modern compilers are very good at optimizing code that follows those rules.

In practice, Toby Speight’s version does optimize better on modern optimizing compilers. Using the flags -std=C++20 -O3 -march=x86-64-v4, Clang 16 compiles the version using addition to

 mov edx, dword ptr [rax]
 xor esi, esi
 cmp edx, 127
 seta sil
 cmp edx, 2048
 sbb rbx, -1
 add rbx, rsi
 cmp edx, 65536
 sbb rbx, -1
 inc rbx

The optimizer is turning the additions into subtract-with borrow instructions (sbb rbx, -1), which add the carry flag to the second operand. When the cmp instruction sets the carry, it therefore subtracts (-1 + 1), the same as adding 0. When the cmp instruction clears the carry, it subtracts (-1 + 0), the same as adding 1.

However, the code using either if statements or ternary ? expressions produces conditional branches on GCC 13 or Clang 16. ICX 2022 compiles them into conditional moves. The version that adds relational expressions produces branchless code, although it still depends on the flags. So it should be slightly better than if/else on ICX, but significantly better on GCC or Clang.

In practice, you are likely to be working with a lot of codepoints in the same range, so the branch predictor would probably figure out quickly that (for example) most codepoints it’s seeing are ASCII, a few are in Latin Extended, and almost none are outside the BMP.

Benchmark Performance in Loops

The only way this function will ever be performance-critical is if you use it in a loop. Which is how it would normally be used: you run functions like this on strings, not isolated characters.

Here is a more realistic example of how you would be using this: a template function to calculate the length of the UTF-8 conversion of any container holding a type compatible with char32_t: In other words, it works on anything wide-stringy: a std::u32string, a std::vector<uint32_t>, or for that matter, it would work on an unsigned char[BUF_SIZE] holding ISO-8859-1 data.

#include <algorithm> // transform_reduce
#include <concepts> // convertible_to
#include <numeric> // plus
#include <ranges>
#include <stdint.h>
#include <stddef.h>
#include <string>
char utf8_char_size(uint32_t ch) {
 if (ch < 0x80) return 1;
 if (ch < 0x800) return 2;
 if (ch < 0x10000) return 3;
 return 4;
}
// Adaptation of Toby Speight's code:
constexpr uint8_t utf8_unit_size(const char32_t c) noexcept
{
 return 1U
 + (c >= 0x80)
 + (c >= 0x800)
 + (c >= 0x10000);
}
template<class T>
 requires std::ranges::input_range<T> &&
 std::convertible_to<std::ranges::range_value_t<T>, char32_t>
constexpr size_t utf8_str_size(const T& s) noexcept
{
 return std::transform_reduce(
 std::ranges::begin(s),
 std::ranges::end(s),
 (size_t)0,
 std::plus<size_t>{},
 utf8_unit_size
 );
}

Using either your helper function or Toby Speight’s, modern compilers turn this into highly-optimized, vectorized code. You could even, with little additional effort, add a parallel execution policy. Here is the inner loop of the version calling your function, exactly as you wrote it, on GCC 16 with -std=c++20 -O3 -march=x86-64-v4:

.LBB1_8: # =>This Inner Loop Header: Depth=1
 vmovdqu ymm15, ymmword ptr [r9 - 192]
 vmovdqu64 ymm22, ymmword ptr [r9 - 160]
 vmovdqu ymm0, ymmword ptr [r9 - 128]
 vmovdqu64 ymm23, ymmword ptr [r9 - 96]
 vmovdqa64 ymm31, ymm15
 vmovdqa xmm1, xmmword ptr [rip + .LCPI1_0] # xmm1 = [0,4,8,12]
 vmovdqa64 ymm21, ymm1
 vpermt2d ymm31, ymm1, ymm22
 vmovdqa64 ymm16, ymm0
 vpermt2d ymm16, ymm1, ymm23
 vmovdqu ymm1, ymmword ptr [r9 - 32]
 vmovdqu ymm13, ymmword ptr [r9 - 64]
 vmovdqa64 ymm29, ymm13
 vmovdqu ymm2, ymmword ptr [r9 + 32]
 vmovdqu ymm14, ymmword ptr [r9]
 vpermt2d ymm29, ymm21, ymm1
 vmovdqa64 ymm30, ymm14
 vpermt2d ymm30, ymm21, ymm2
 vmovdqa64 ymm25, ymm15
 vmovdqa64 xmm21, xmmword ptr [rip + .LCPI1_1] # xmm21 = [1,5,9,13]
 vpermt2d ymm25, ymm21, ymm22
 vmovdqa64 ymm26, ymm0
 vpermt2d ymm26, ymm21, ymm23
 vmovdqa64 ymm27, ymm13
 vpermt2d ymm27, ymm21, ymm1
 vmovdqa64 ymm28, ymm14
 vpermt2d ymm28, ymm21, ymm2
 vmovdqa64 ymm21, ymm15
 vmovdqa xmm3, xmmword ptr [rip + .LCPI1_2] # xmm3 = [2,6,10,14]
 vpermt2d ymm21, ymm3, ymm22
 vmovdqa xmm12, xmmword ptr [rip + .LCPI1_3] # xmm12 = [3,7,11,15]
 vpermt2d ymm15, ymm12, ymm22
 vmovdqa64 ymm24, ymm0
 vpermt2d ymm24, ymm3, ymm23
 vpermt2d ymm0, ymm12, ymm23
 vmovdqa64 ymm22, ymm13
 vpermt2d ymm22, ymm3, ymm1
 vpermt2d ymm13, ymm12, ymm1
 vmovdqa64 ymm23, ymm14
 vpermt2d ymm23, ymm3, ymm2
 vpermt2d ymm14, ymm12, ymm2
 vpcmpltud k2, xmm31, xmm4
 vpcmpltud k1, xmm16, xmm4
 kmovw word ptr [rsp - 26], k1 # 2-byte Spill
 vpcmpltud k4, xmm31, xmm5
 vpcmpltud k3, xmm16, xmm5
 vpaddd xmm1, xmm31, xmm8
 vpcmpltud k5, xmm1, xmm9
 vpblendmq ymm1 {k4}, ymm7, ymm6
 vpaddd xmm2, xmm16, xmm8
 vpcmpltud k4, xmm2, xmm9
 vpblendmq ymm2 {k5}, ymm11, ymm10
 vmovdqa64 ymm1 {k2}, ymm2
 vpcmpltud k2, xmm29, xmm4
 vpcmpltud k5, xmm30, xmm4
 vpcmpltud k6, xmm29, xmm5
 vpcmpltud k7, xmm30, xmm5
 vpaddd xmm2, xmm29, xmm8
 vpcmpltud k1, xmm2, xmm9
 vpblendmq ymm2 {k3}, ymm7, ymm6
 vpaddd xmm16, xmm30, xmm8
 vpcmpltud k3, xmm16, xmm9
 vpaddq ymm30, ymm1, ymm17
 vpblendmq ymm1 {k4}, ymm11, ymm10
 kmovw k4, word ptr [rsp - 26] # 2-byte Reload
 vmovdqa64 ymm2 {k4}, ymm1
 vpblendmq ymm1 {k6}, ymm7, ymm6
 vpaddq ymm29, ymm2, ymm18
 vpblendmq ymm2 {k1}, ymm11, ymm10
 vmovdqa64 ymm1 {k2}, ymm2
 vpblendmq ymm2 {k7}, ymm7, ymm6
 vpaddq ymm18, ymm1, ymm19
 vpblendmq ymm1 {k3}, ymm11, ymm10
 vmovdqa64 ymm2 {k5}, ymm1
 vpaddq ymm17, ymm2, ymm20
 vpcmpltud k4, xmm25, xmm4
 vpcmpltud k3, xmm26, xmm4
 vpcmpltud k2, xmm27, xmm4
 vpcmpltud k1, xmm28, xmm4
 kmovw word ptr [rsp - 26], k1 # 2-byte Spill
 vpcmpltud k5, xmm25, xmm5
 vpcmpltud k6, xmm26, xmm5
 vpcmpltud k7, xmm27, xmm5
 vpcmpltud k1, xmm28, xmm5
 vpblendmq ymm1 {k5}, ymm7, ymm6
 vpaddd xmm2, xmm25, xmm8
 vpcmpltud k5, xmm2, xmm9
 vpblendmq ymm2 {k6}, ymm7, ymm6
 vpaddd xmm16, xmm26, xmm8
 vpcmpltud k6, xmm16, xmm9
 vpblendmq ymm16 {k7}, ymm7, ymm6
 vpaddd xmm19, xmm27, xmm8
 vpcmpltud k7, xmm19, xmm9
 vpblendmq ymm19 {k1}, ymm7, ymm6
 vpaddd xmm20, xmm28, xmm8
 vpcmpltud k1, xmm20, xmm9
 vpblendmq ymm20 {k5}, ymm11, ymm10
 vmovdqa64 ymm1 {k4}, ymm20
 vpblendmq ymm20 {k6}, ymm11, ymm10
 vmovdqa64 ymm2 {k3}, ymm20
 vpblendmq ymm20 {k7}, ymm11, ymm10
 vmovdqa64 ymm16 {k2}, ymm20
 vpblendmq ymm20 {k1}, ymm11, ymm10
 kmovw k1, word ptr [rsp - 26] # 2-byte Reload
 vmovdqa64 ymm19 {k1}, ymm20
 vpcmpltud k2, xmm21, xmm4
 vpcmpltud k1, xmm24, xmm4
 kmovw word ptr [rsp - 26], k1 # 2-byte Spill
 vpcmpltud k4, xmm21, xmm5
 vpcmpltud k3, xmm24, xmm5
 vpaddd xmm20, xmm21, xmm8
 vpcmpltud k5, xmm20, xmm9
 vpblendmq ymm20 {k4}, ymm7, ymm6
 vpaddd xmm21, xmm24, xmm8
 vpcmpltud k4, xmm21, xmm9
 vpblendmq ymm21 {k5}, ymm11, ymm10
 vmovdqa64 ymm20 {k2}, ymm21
 vpcmpltud k2, xmm22, xmm4
 vpcmpltud k5, xmm23, xmm4
 vpcmpltud k6, xmm22, xmm5
 vpcmpltud k7, xmm23, xmm5
 vpaddd xmm21, xmm22, xmm8
 vpcmpltud k1, xmm21, xmm9
 vpblendmq ymm21 {k3}, ymm7, ymm6
 vpaddd xmm22, xmm23, xmm8
 vpcmpltud k3, xmm22, xmm9
 vpaddq ymm1, ymm1, ymm20
 vpaddq ymm1, ymm30, ymm1
 vpblendmq ymm20 {k4}, ymm11, ymm10
 kmovw k4, word ptr [rsp - 26] # 2-byte Reload
 vmovdqa64 ymm21 {k4}, ymm20
 vpaddq ymm2, ymm2, ymm21
 vpblendmq ymm20 {k6}, ymm7, ymm6
 vpaddq ymm2, ymm29, ymm2
 vpblendmq ymm21 {k1}, ymm11, ymm10
 vmovdqa64 ymm20 {k2}, ymm21
 vpaddq ymm16, ymm16, ymm20
 vpblendmq ymm20 {k7}, ymm7, ymm6
 vpaddq ymm16, ymm18, ymm16
 vpblendmq ymm18 {k3}, ymm11, ymm10
 vmovdqa64 ymm20 {k5}, ymm18
 vpaddq ymm18, ymm19, ymm20
 vpaddq ymm20, ymm17, ymm18
 vmovdqa xmm3, xmmword ptr [rsp - 24] # 16-byte Reload
 vpcmpnleud k2, xmm15, xmm3
 vpcmpnleud k1, xmm0, xmm3
 kmovw word ptr [rsp - 26], k1 # 2-byte Spill
 vpcmpltud k4, xmm15, xmm5
 vpcmpltud k3, xmm0, xmm5
 vpaddd xmm15, xmm15, xmm8
 vpcmpltud k5, xmm15, xmm9
 vpblendmq ymm15 {k4}, ymm7, ymm6
 vpaddd xmm0, xmm8, xmm0
 vpcmpltud k4, xmm0, xmm9
 vpblendmq ymm0 {k5}, ymm11, ymm10
 vmovdqa64 ymm0 {k2}, ymm15
 vpcmpnleud k2, xmm13, xmm3
 vpcmpnleud k5, xmm14, xmm3
 vpcmpltud k6, xmm13, xmm5
 vpcmpltud k7, xmm14, xmm5
 vpaddd xmm13, xmm13, xmm8
 vpcmpltud k1, xmm13, xmm9
 vpblendmq ymm13 {k3}, ymm7, ymm6
 vpaddd xmm14, xmm14, xmm8
 vpcmpltud k3, xmm14, xmm9
 vpaddq ymm17, ymm1, ymm0
 vpblendmq ymm0 {k4}, ymm11, ymm10
 kmovw k4, word ptr [rsp - 26] # 2-byte Reload
 vmovdqa64 ymm0 {k4}, ymm13
 vpblendmq ymm1 {k6}, ymm7, ymm6
 vpaddq ymm18, ymm2, ymm0
 vpblendmq ymm0 {k1}, ymm11, ymm10
 vmovdqa64 ymm0 {k2}, ymm1
 vpblendmq ymm1 {k7}, ymm7, ymm6
 vpaddq ymm19, ymm16, ymm0
 vpblendmq ymm0 {k3}, ymm11, ymm10
 vmovdqa64 ymm0 {k5}, ymm1
 vpaddq ymm20, ymm20, ymm0
 add r9, 256
 add r10, -16
 jne .LBB1_8

This isn’t too bad: the compiler was able to turn utf8_str_size<u32string> into branchless vectorized code using the full-width registers. You can see that sometimes it has to fall back on 128-bit instead of 256-bit instructions, and sometimes it runs out of registers and needs to spill.

Using Toby Speight’s version as the transformation function gives us this output instead:

.LBB1_8: # =>This Inner Loop Header: Depth=1
 vmovdqu ymm12, ymmword ptr [rax - 64]
 vmovdqu ymm13, ymmword ptr [rax - 32]
 vmovdqu ymm11, ymmword ptr [rax]
 vmovdqu ymm14, ymmword ptr [rax + 32]
 vmovdqa64 ymm17, ymm12
 vmovdqa xmm0, xmmword ptr [rip + .LCPI1_0] # xmm0 = [0,4,8,12]
 vpermt2d ymm17, ymm0, ymm13
 vmovdqa64 ymm18, ymm11
 vpermt2d ymm18, ymm0, ymm14
 vmovdqa64 ymm19, ymm12
 vmovdqa xmm0, xmmword ptr [rip + .LCPI1_1] # xmm0 = [1,5,9,13]
 vpermt2d ymm19, ymm0, ymm13
 vmovdqa64 ymm20, ymm11
 vpermt2d ymm20, ymm0, ymm14
 vmovdqa ymm15, ymm12
 vmovdqa xmm0, xmmword ptr [rip + .LCPI1_2] # xmm0 = [2,6,10,14]
 vpermt2d ymm15, ymm0, ymm13
 vmovdqa64 ymm16, ymm11
 vpermt2d ymm16, ymm0, ymm14
 vpermt2d ymm12, ymm3, ymm13
 vpermt2d ymm11, ymm3, ymm14
 vpcmpnleud k1, xmm17, xmm4
 vpblendmq ymm13 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm18, xmm4
 vpblendmq ymm14 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm19, xmm4
 vpblendmq ymm21 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm20, xmm4
 vpblendmq ymm22 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm15, xmm4
 vpblendmq ymm23 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm16, xmm4
 vpblendmq ymm24 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm12, xmm4
 vpblendmq ymm25 {k1}, ymm6, ymm5
 vpcmpnleud k1, xmm11, xmm4
 vpblendmq ymm26 {k1}, ymm6, ymm5
 vpcmpnleud k0, xmm17, xmm7
 vpmovm2q ymm27, k0
 vpcmpnleud k0, xmm18, xmm7
 vpmovm2q ymm28, k0
 vpcmpnleud k0, xmm17, xmm8
 vpmovm2q ymm17, k0
 vpcmpnleud k0, xmm18, xmm8
 vpmovm2q ymm18, k0
 vpcmpnleud k0, xmm19, xmm7
 vpmovm2q ymm29, k0
 vpcmpnleud k0, xmm20, xmm7
 vpmovm2q ymm30, k0
 vpcmpnleud k0, xmm19, xmm8
 vpmovm2q ymm19, k0
 vpcmpnleud k0, xmm20, xmm8
 vpmovm2q ymm20, k0
 vpcmpnleud k0, xmm15, xmm7
 vpmovm2q ymm31, k0
 vpcmpnleud k0, xmm16, xmm7
 vpmovm2q ymm0, k0
 vpcmpnleud k0, xmm15, xmm8
 vpmovm2q ymm15, k0
 vpcmpnleud k0, xmm16, xmm8
 vpmovm2q ymm16, k0
 vpcmpnleud k0, xmm12, xmm7
 vpmovm2q ymm1, k0
 vpcmpnleud k0, xmm11, xmm7
 vpmovm2q ymm2, k0
 vpcmpnleud k0, xmm12, xmm8
 vpmovm2q ymm12, k0
 vpcmpnleud k0, xmm11, xmm8
 vpmovm2q ymm11, k0
 vpsubq ymm9, ymm9, ymm27
 vpaddq ymm9, ymm9, ymm13
 vpsubq ymm10, ymm10, ymm28
 vpaddq ymm10, ymm10, ymm14
 vpsubq ymm9, ymm9, ymm17
 vpsubq ymm10, ymm10, ymm18
 vpsubq ymm9, ymm9, ymm29
 vpaddq ymm9, ymm9, ymm21
 vpsubq ymm10, ymm10, ymm30
 vpaddq ymm10, ymm10, ymm22
 vpsubq ymm9, ymm9, ymm19
 vpsubq ymm10, ymm10, ymm20
 vpsubq ymm9, ymm9, ymm31
 vpaddq ymm9, ymm9, ymm23
 vpsubq ymm0, ymm10, ymm0
 vpaddq ymm0, ymm0, ymm24
 vpsubq ymm9, ymm9, ymm15
 vpsubq ymm0, ymm0, ymm16
 vpsubq ymm1, ymm9, ymm1
 vpaddq ymm1, ymm1, ymm25
 vpsubq ymm0, ymm0, ymm2
 vpaddq ymm0, ymm0, ymm26
 vpsubq ymm9, ymm1, ymm12
 vpsubq ymm10, ymm0, ymm11
 sub rax, -128
 add r10, -8
 jne .LBB1_8

This version needs fewer instructions inside the same loop (unrolled to operate on 32 words per iteration). Because it uses registers more efficiently, it does not need to spill any onto the stack. The additions are again transformed into subtractions, but this time, it’s because the vector comparisons turn true into the bitmask 0xFFFFFFFF, which is -1L.

Finally, a link to the code on Godbolt.

answered May 23, 2023 at 19:22
\$\endgroup\$
6
  • \$\begingroup\$ Very informative, but it seems you did not want to explicitly time things (understandably). Maybe a bit more story would have made it even nicer. \$\endgroup\$ Commented May 24, 2023 at 21:03
  • 1
    \$\begingroup\$ @JoopEggen Find some extremely large multilingual text? And by extremely large, I mean vectorized code like this will breeze through megabytes in milliseconds, so time a loop that runs them a thousand or a million times. \$\endgroup\$ Commented May 24, 2023 at 22:03
  • \$\begingroup\$ @JoopEggen And if you want it to be really fast, add an execution policy of std::execution::par, or rewrite as a loop with #pragma omp parallel for simd, to enable both vectorization and multithreading. For best results, use ICX with -fiopenmp \$\endgroup\$ Commented May 24, 2023 at 22:05
  • \$\begingroup\$ Probably better to use <cstdint> and <cstddef> rather than the C versions of these headers (despite the implications of the code presented for review). The only valid use for the C headers is in header files to be included into programs written in both languages. \$\endgroup\$ Commented May 25, 2023 at 6:08
  • \$\begingroup\$ Thanks greatly for providing such detailed analysis, especially of my answer that I wrote really quickly. \$\endgroup\$ Commented May 25, 2023 at 6:10

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.