Name	Name	Last commit message	Last commit date
Latest commit History 198 Commits
.github/workflows	.github/workflows
benchmarks	benchmarks
bin	bin
doc	doc
gen	gen
impl	impl
test	test
.gitignore	.gitignore
LICENSE.md	LICENSE.md
README.md	README.md
build.zig	build.zig
build.zig.zon	build.zig.zon
common_defs.h	common_defs.h
compile_flags.txt	compile_flags.txt
xxutf.c	xxutf.c
xxutf.h	xxutf.h

xxUTF

xxUTF is a C library that implements Unicode text transformation algorithms at speed using SIMD. Current algorithms supported:

All algorithms are compatible with UTF-8, UTF-16LE, and UTF-16BE. Further helper functions are defined for efficient and correct streaming versions of these algorithms. See the API for details.

xxUTF never allocates memory, does not depend on libc, cannot fail, and has the fastest open source implementations of the listed algorithms available. All functions are comprehensively tested using both the available Unicode test suites and a fuzzer.

xxUTF supports Unicode 16.0.0 and below.

Usage

xxUTF is distributed as an amalgamation with a single header file, available at the release page. This is similar to what SQLite does.

Example C program:

#include <xxutf.h>
#include <string.h>
#include <stddef.h>
#include <stdio.h>
int main() {
 const char *s = "Ȁ character that needs to be decomposed";
 size_t length = strlen(s);
 printf("Old length: %zu\n", length);
 char out[64];
 size_t out_len = xxutf_normalize_utf8_nfd(s, strlen(s), out);
 printf("New length: %zu\n", out_len);
 return 0;
}

One major goal of xxUTF is to have the simplest, most predictable API surface as possible. As such, one function call usually suffices for the core functionality.

API

The xxUTF's core API follows this pattern:

/// Normalize the Unicode text bytes in the given normalization form, returning the
/// length of the output. All lengths are measured in bytes. The input is expected
/// to be valid under the specified encoding.
///
/// It is assumed that the output buffer is large enough to fit the full normalized
/// form of the input. The encoding of the output will match the encoding of the
/// input. Note that, regardless of if the input is encoded in UTF-16 or UTF-8, the
/// input is still a byte pointer. xxUTF does not require the input to be aligned,
/// and the performance difference is marginal even if it is.
size_t xxutf_normalize_ENCODING_FORM(const uint8_t *input, size_t length, uint8_t *out);
/// Check if the input is already in the specified form, returning `true` if so.
/// Additionally, the `out_length` out parameter is set to the size of an output
/// buffer needed to hold the normalized form of the input. The input is expected
/// to be valid under the specified encoding.
///
/// Note that for NFC and NFKC, `out_length` is actually an upper bound calculated
/// from the input, not the exact size. This is the case for speed reasons. All lengths
/// are measured in bytes.
bool xxutf_normalize_ENCODING_FORM_check(const uint8_t *input, size_t length, size_t *out_length);
/// Case fold the Unicode text bytes, returning the length of the output. All
/// lengths are measured in bytes. The input is expected to be valid under the
/// specified encoding.
///
/// It is assumed that the output buffer is large enough to fit the full case folded
/// form of the input. The encoding of the output wil match the encoding of the input.
/// Note that, regardless of if the input is encoded in UTF-16 or UTF-8, the input is
/// still a byte pointer. xxUTF does not require the input to be aligned, and the
/// performance difference is marginal even if it is.
size_t xxutf_casefold_ENCODING(const uint8_t *input, size_t length, uint8_t *out);
/// Check if the input is already case folded, returning `true` if so. Additionally,
/// the `out_length` out parameter is set to the size of an output buffer needed to
/// hold the normalized form of the input. The input is expected to be valid under the
/// specified encoding.
bool xxutf_casefold_ENCODING_check(const uint8_t *input, size_t length, size_t *out_length);

The valid encodings are:

utf8
utf16le
utf16be

The valid normalization forms are:

nfd
nfc
nfkd
nfkc

For example:

size_t out_length_bound;
bool check = xxutf_normalize_utf16le_nfkc_check(input, length, &out_length_bound);
if (check) {
 printf("Already NFKC normalized!\n");
} else {
 // Allocate according to `out_length_bound`
 uint8_t *out = malloc(out_length_bound + 1);
 size_t out_length = xxutf_normalize_utf16le_nfkc(input, length, out);
 out[out_length] = '0円';
 printf("NFKC normalized to: '%s'\n", out);
}

Like many Unicode processing libraries, xxUTF supports a two-pass pattern:

Get the expected length of the output without writing.
Actually run the algorithm on a properly sized output buffer.

You might wonder why we need the length functions. After all, finding a universal upper bound that depends only on the size of the input is not hard. But using such a bound is often wasteful. For example, as of Unicode 18.0, the largest compatibility decomposition (i.e. from NFKD or NFKC) in the Basic Multilingual Plane is from the character with code 0xFDFA. This character is three bytes wide in UTF-8 pre-decomposition, but expands to an enormous 33 bytes post-decomposition. The best upper bound for NFKD and NFKC in UTF-8 would thus be some number around n * 11 where n is the size of the input.

So, unless a lot of prior information is known about the incoming text, use the length functions to make sure buffer overflows don't happen.

Streaming

The streaming versions of some Unicode algorithms can usually be implemented naively (such as with case folding). However, not all algorithms have such nice properties.

Mainly, normalizing text in a streaming manner requires some care. The problem is that normalization forms are not closed under string concatenation. In other words:

normalize(x) + normalize(y) = normalize(x + y)

does not hold for all Unicode strings x and y. Read the Unicode normalization specification for more details.

xxUTF thus has special APIs so that streaming normalization can be implemented in a non-allocating, efficient way. To see these APIs being used to implement streaming normalization, read the xxu program source code.

The APIs are of the form:

size_t xxutf_find_last_stable_ENCODING_FORM(const char *input, size_t length);
size_t xxutf_find_first_stable_ENCODING_FORM(const char *input, size_t length);

Here is a simple visual that describes these functions' purposes:

[.........*.........] ++ [....*.........]

Where ++ represents concatenation, and * denotes "stable" code points (the actual definition of "stable" is somewhat involved). To properly concatenate these normalized buffers, we must:

Naively concatenate the two buffers
Form a range using the last stable code point of the first buffer and the first stable code point in the second buffer (use streaming APIs for this)
Re-normalize the range found from step 2, in place

Performing these steps without copying large buffers around can be complicated to figure out, so you are encouraged to read the xxu source code.

xxu

xxUTF also provides the xxu tool, which puts the speed of the xxUTF library onto the command line. You can download the tool from the releases page, or build it from source.

Example xxu usage:

xxu -x casefold file.txt

Benchmarks

xxUTF is benchmarked using a variety of large real-world inputs from multiple languages. As there are many factors to consider during benchmarking, curious users are encouraged to run the benchmark suite (or write their own benchmarks) on their machines.

These are the results for running NFD normalization on UTF-8 on a machine supporting ARM NEON. Inputs vary in size and complexity, so cross-input comparison is not meaningful here.

Benchmarks are compared against the ICU4C library, as ICU4C has the current next fastest open source implementations of these algorithms.

Building

xxUTF is built with the Zig build system, version 0.16.0.

To build the project in release mode, run zig build -Doptimize=ReleaseFast. The following artifacts will be created:

zig-out/lib/libxxutf.a: a static library defining the xxUTF functions
zig-out/include/xxutf.h: the xxUTF header file
zig-out/bin/xxu: the xxu tool

To create the amalgamation file, run zig build amalgamate. It will be put in zig-out/amalgamation.c.

Running the benchmarks is as simple as running zig build bench. But note that ICU4C is required to exist on your system. Make sure that the installed versions match the Unicode version that xxUTF is built for.

For all available build options, run zig build --help.

Fuzzing

xxUTF is fuzz tested to improve correctness and safety. To look into the exact fuzzing setup, see the relevant README file.

State of the Project

xxUTF is ready for use as a library, but as it is in an alpha state, there is much to be done, such as:

Support more SIMD instruction sets. Although not many instruction sets are supported right now, the good news is that xxUTF is still the fastest implementation even when in scalar mode.
Double check possible security vulnerabilities
Add a few helper functions to the API (functions to check if strings are in a certain normalization form, check string equality in normalized forms, Turkic casefold, etc.)

License

xxUTF is licenced under the MIT license.

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dzfrias/xxUTF

Folders and files

Latest commit

History

Repository files navigation

xxUTF

Usage

API

Streaming

xxu

Benchmarks

Building

Fuzzing

State of the Project

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Contributors

Uh oh!

Languages