xxUTF is a C library that implements Unicode text transformation algorithms at speed using SIMD. Current algorithms supported:
All algorithms are compatible with UTF-8, UTF-16LE, and UTF-16BE. Further helper functions are defined for efficient and correct streaming versions of these algorithms. See the API for details.
xxUTF never allocates memory, does not depend on libc, cannot fail, and has the fastest open source implementations of the listed algorithms available. All functions are comprehensively tested using both the available Unicode test suites and a fuzzer.
xxUTF supports Unicode 16.0.0 and below.
xxUTF is distributed as an amalgamation with a single header file, available at the release page. This is similar to what SQLite does.
Example C program:
#include <xxutf.h> #include <string.h> #include <stddef.h> #include <stdio.h> int main() { const char *s = "Ȁ character that needs to be decomposed"; size_t length = strlen(s); printf("Old length: %zu\n", length); char out[64]; size_t out_len = xxutf_normalize_utf8_nfd(s, strlen(s), out); printf("New length: %zu\n", out_len); return 0; }
One major goal of xxUTF is to have the simplest, most predictable API surface as possible. As such, one function call usually suffices for the core functionality.
The xxUTF's core API follows this pattern:
/// Normalize the Unicode text bytes in the given normalization form, returning the /// length of the output. All lengths are measured in bytes. The input is expected /// to be valid under the specified encoding. /// /// It is assumed that the output buffer is large enough to fit the full normalized /// form of the input. The encoding of the output will match the encoding of the /// input. Note that, regardless of if the input is encoded in UTF-16 or UTF-8, the /// input is still a byte pointer. xxUTF does not require the input to be aligned, /// and the performance difference is marginal even if it is. size_t xxutf_normalize_ENCODING_FORM(const uint8_t *input, size_t length, uint8_t *out); /// Check if the input is already in the specified form, returning `true` if so. /// Additionally, the `out_length` out parameter is set to the size of an output /// buffer needed to hold the normalized form of the input. The input is expected /// to be valid under the specified encoding. /// /// Note that for NFC and NFKC, `out_length` is actually an upper bound calculated /// from the input, not the exact size. This is the case for speed reasons. All lengths /// are measured in bytes. bool xxutf_normalize_ENCODING_FORM_check(const uint8_t *input, size_t length, size_t *out_length); /// Case fold the Unicode text bytes, returning the length of the output. All /// lengths are measured in bytes. The input is expected to be valid under the /// specified encoding. /// /// It is assumed that the output buffer is large enough to fit the full case folded /// form of the input. The encoding of the output wil match the encoding of the input. /// Note that, regardless of if the input is encoded in UTF-16 or UTF-8, the input is /// still a byte pointer. xxUTF does not require the input to be aligned, and the /// performance difference is marginal even if it is. size_t xxutf_casefold_ENCODING(const uint8_t *input, size_t length, uint8_t *out); /// Check if the input is already case folded, returning `true` if so. Additionally, /// the `out_length` out parameter is set to the size of an output buffer needed to /// hold the normalized form of the input. The input is expected to be valid under the /// specified encoding. bool xxutf_casefold_ENCODING_check(const uint8_t *input, size_t length, size_t *out_length);
The valid encodings are:
utf8utf16leutf16be
The valid normalization forms are:
nfdnfcnfkdnfkc
For example:
size_t out_length_bound; bool check = xxutf_normalize_utf16le_nfkc_check(input, length, &out_length_bound); if (check) { printf("Already NFKC normalized!\n"); } else { // Allocate according to `out_length_bound` uint8_t *out = malloc(out_length_bound + 1); size_t out_length = xxutf_normalize_utf16le_nfkc(input, length, out); out[out_length] = '0円'; printf("NFKC normalized to: '%s'\n", out); }
Like many Unicode processing libraries, xxUTF supports a two-pass pattern:
- Get the expected length of the output without writing.
- Actually run the algorithm on a properly sized output buffer.
You might wonder why we need the length functions. After all, finding a
universal upper bound that depends only on the size of the input is not hard.
But using such a bound is often wasteful. For example, as of Unicode 18.0, the
largest compatibility decomposition (i.e. from NFKD or NFKC) in the
Basic Multilingual Plane
is from the character with code 0xFDFA. This character is three bytes wide in
UTF-8 pre-decomposition, but expands to an enormous 33 bytes post-decomposition.
The best upper bound for NFKD and NFKC in UTF-8 would thus be some number around
n * 11 where n is the size of the input.
So, unless a lot of prior information is known about the incoming text, use the length functions to make sure buffer overflows don't happen.
The streaming versions of some Unicode algorithms can usually be implemented naively (such as with case folding). However, not all algorithms have such nice properties.
Mainly, normalizing text in a streaming manner requires some care. The problem is that normalization forms are not closed under string concatenation. In other words:
normalize(x) + normalize(y) = normalize(x + y)
does not hold for all Unicode strings x and y. Read the
Unicode normalization specification
for more details.
xxUTF thus has special APIs so that streaming normalization can be implemented in a non-allocating, efficient way. To see these APIs being used to implement streaming normalization, read the xxu program source code.
The APIs are of the form:
size_t xxutf_find_last_stable_ENCODING_FORM(const char *input, size_t length); size_t xxutf_find_first_stable_ENCODING_FORM(const char *input, size_t length);
Here is a simple visual that describes these functions' purposes:
[.........*.........] ++ [....*.........]
Where ++ represents concatenation, and * denotes "stable" code points (the
actual definition of "stable" is somewhat involved). To properly concatenate
these normalized buffers, we must:
- Naively concatenate the two buffers
- Form a range using the last stable code point of the first buffer and the first stable code point in the second buffer (use streaming APIs for this)
- Re-normalize the range found from step 2, in place
Performing these steps without copying large buffers around can be complicated to figure out, so you are encouraged to read the xxu source code.
xxUTF also provides the xxu tool, which puts the speed of the xxUTF library
onto the command line. You can download the tool from the
releases page, or
build it from source.
Example xxu usage:
xxu -x casefold file.txt
xxUTF is benchmarked using a variety of large real-world inputs from multiple languages. As there are many factors to consider during benchmarking, curious users are encouraged to run the benchmark suite (or write their own benchmarks) on their machines.
These are the results for running NFD normalization on UTF-8 on a machine supporting ARM NEON. Inputs vary in size and complexity, so cross-input comparison is not meaningful here.
Benchmarks are compared against the ICU4C library, as ICU4C has the current next fastest open source implementations of these algorithms.
xxUTF is built with the Zig build system, version 0.16.0.
To build the project in release mode, run zig build -Doptimize=ReleaseFast.
The following artifacts will be created:
zig-out/lib/libxxutf.a: a static library defining the xxUTF functionszig-out/include/xxutf.h: the xxUTF header filezig-out/bin/xxu: thexxutool
To create the amalgamation file, run zig build amalgamate. It will be put in
zig-out/amalgamation.c.
Running the benchmarks is as simple as running zig build bench. But note that
ICU4C is required to exist on your system. Make sure that the installed versions
match the Unicode version that xxUTF is built for.
For all available build options, run zig build --help.
xxUTF is fuzz tested to improve correctness and safety. To look into the exact fuzzing setup, see the relevant README file.
xxUTF is ready for use as a library, but as it is in an alpha state, there is much to be done, such as:
- Support more SIMD instruction sets. Although not many instruction sets are supported right now, the good news is that xxUTF is still the fastest implementation even when in scalar mode.
- Double check possible security vulnerabilities
- Add a few helper functions to the API (functions to check if strings are in a certain normalization form, check string equality in normalized forms, Turkic casefold, etc.)
xxUTF is licenced under the MIT license.