The benchmarks in this repository don't aim to cover every topic entirely, but they help form a mindset and intuition for performance-oriented software design. It also provides an example of using some non-STL but de facto standard libraries in C++, importing them via CMake and compiling from source. For higher-level abstractions and languages, check out
less_slow.rsandless_slow.py. I needed many of these measurements to reconsider my own coding habits, but hopefully they're helpful to others as well. Most of the code is organized in very long, ordered, and nested#pragmasections β not necessarily the preferred form for everyone.
Much of modern code suffers from common pitfalls β bugs, security vulnerabilities, and performance bottlenecks. University curricula and coding bootcamps tend to stick to traditional coding styles and standard features, rarely exposing the more fun, unusual, and potentially efficient design opportunities. This repository explores just that.
The code leverages C++20 and CUDA features and is designed primarily for GCC, Clang, and NVCC compilers on Linux, though it may work on other platforms. The topics range from basic micro-kernels executing in a few nanoseconds to more complex constructs involving parallel algorithms, coroutines, and polymorphism. Some of the highlights include:
- 100x cheaper random inputs?! Discover how input generation sometimes costs more than the algorithm.
- 1% error in trigonometry at 1/40 cost: Approximate STL functions like std::sinin just 3 lines of code.
- 4x faster lazy-logic with custom std::rangesand iterators!
- Compiler optimizations beyond -O3: Learn about less obvious flags and techniques for another 2x speedup.
- Multiplying matrices? Check how a 3x3x3 GEMM can be 70% slower than 4x4x4, despite 60% fewer ops.
- Scaling AI? Measure the gap between theoretical ALU throughput and your BLAS.
- How many if conditions are too many? Test your CPU's branch predictor with just 10 lines of code.
- Prefer recursion to iteration? Measure the depth at which your algorithm will SEGFAULT.
- Why avoid exceptions? Take std::error_codeorstd::variant-like ADTs?
- Scaling to many cores? Learn how to use OpenMP, Intel's oneTBB, or your custom thread pool.
- How to handle JSON avoiding memory allocations? Is it easier with C++ 20 or old-school C 99 tools?
- How to properly use STL's associative containers with custom keys and transparent comparators?
- How to beat a hand-written parser with constevalRegEx engines?
- Is the pointer size really 64 bits and how to exploit pointer-tagging?
- How many packets is UDP dropping and how to serve web requests in io_uringfrom user-space?
- Scatter and Gather for 50% faster vectorized disjoint memory operations.
- Intel's oneAPI vs Nvidia's CCCL? What's so special about <thrust>and<cub>?
- CUDA C++, PTX Intermediate Representations, and SASS, and how do they differ from CPU code?
- How to choose between intrinsics, inline asm, and separate.Sfiles for your performance-critical code?
- Tensor Cores & Memory differences on CPUs, and Volta, Ampere, Hopper, and Blackwell GPUs!
- How coding FPGA differs from GPU and what is High-Level Synthesis, Verilog, and VHDL? π #36
- What are Encrypted Enclaves and what's the latency of Intel SGX, AMD SEV, and ARM Realm? π #31
To read, jump to the less_slow.cpp source file and read the code snippets and comments.
Keep in mind, that most modern IDEs have a navigation bar to help you view and jump between #pragma region sections.
Follow the instructions below to run the code in your environment and compare it to the comments as you read through the source.
The project aims to be compatible with GCC, Clang, and MSVC compilers on Linux, MacOS, and Windows. That said, to cover the broadest functionality, using GCC on Linux is recommended:
- If you are on Windows, it's recommended that you set up a Linux environment using WSL.
- If you are on MacOS, consider using the non-native distribution of Clang from Homebrew or MacPorts.
- If you are on Linux, make sure to install CMake and a recent version of GCC or Clang compilers to support C++20 features.
If you are familiar with C++ and want to review code and measurements as you read, you can clone the repository and execute the following commands.
git clone https://github.com/ashvardanian/less_slow.cpp.git # Clone the repository cd less_slow.cpp # Change the directory pip install cmake --upgrade # PyPI has a newer version of CMake sudo apt-get install -y build-essential g++ # Install default build tools sudo apt-get install -y pkg-config liburing-dev # Install liburing for kernel-bypass sudo apt-get install -y libopenblas-base # Install numerics libraries cmake -B build_release -D CMAKE_BUILD_TYPE=Release # Generate the build files cmake --build build_release --config Release # Build the project build_release/less_slow # Run the benchmarks
The build will pull and compile several third-party dependencies from the source:
- Google's Benchmark is used for profiling.
- Intel's oneTBB is used as the Parallel STL backend.
- Meta's libunifex is used for senders & executors.
- Eric Niebler's range-v3 replaces std::ranges.
- Victor Zverovich's fmt replaces std::format.
- Ash Vardanian's StringZilla replaces std::string.
- Hana DusΓkovΓ‘'s CTRE replaces std::regex.
- Niels Lohmann's json is used for JSON deserialization.
- Yaoyuan Guo's yyjson for faster JSON processing.
- Google's Abseil replaces STL's associative containers.
- Lewis Baker's cppcoro implements C++20 coroutines.
- Jens Axboe's liburing to simplify Linux kernel-bypass.
- Chris Kohlhoff's ASIO as a networking TS extension.
- Nvidia's CCCL for GPU-accelerated algorithms.
- Nvidia's CUTLASS for GPU-accelerated Linear Algebra.
To build without Parallel STL, Intel TBB, BLAS, and CUDA:
cmake -B build_release -D CMAKE_BUILD_TYPE=Release -D USE_INTEL_TBB=OFF -D USE_NVIDIA_CCCL=OFF -D USE_BLAS=OFF cmake --build build_release --config Release
To build on MacOS, pulling key dependencies from Homebrew:
brew install openblas cmake -B build_release \ -D CMAKE_BUILD_TYPE=Release \ -D CMAKE_C_FLAGS="-I$(brew --prefix openblas)/include" \ -D CMAKE_CXX_FLAGS="-I$(brew --prefix openblas)/include" \ -D CMAKE_EXE_LINKER_FLAGS="-L$(brew --prefix openblas)/lib" cmake --build build_release --config Release
To control the output or run specific benchmarks, use the following flags:
build_release/less_slow --benchmark_format=json # Output in JSON format build_release/less_slow --benchmark_out=results.json # Save the results to a file instead of `stdout` build_release/less_slow --benchmark_filter=std_sort # Run only benchmarks containing `std_sort` in their name
To enhance stability and reproducibility, disable Simultaneous Multi-Threading (SMT) on your CPU and use the --benchmark_enable_random_interleaving=true flag, which shuffles and interleaves benchmarks as described here.
build_release/less_slow --benchmark_enable_random_interleaving=true
Google Benchmark supports User-Requested Performance Counters through libpmf.
Note that collecting these may require sudo privileges.
sudo build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_format=json --benchmark_perf_counters="CYCLES,INSTRUCTIONS"Alternatively, use the Linux perf tool for performance counter collection:
sudo perf stat taskset 0xEFFFEFFFEFFFEFFFEFFFEFFFEFFFEFFF build_release/less_slow --benchmark_enable_random_interleaving=true --benchmark_filter=super_sort
The primary file of this repository is clearly the less_slow.cpp C++ file with CPU-side code.
Several other files for different hardware-specific optimizations are created:
$ tree . . βββ CMakeLists.txt # Build & assembly instructions for all files βββ less_slow.cpp # Primary CPU-side benchmarking code with the majority of examples βββ less_slow_amd64.S # Hand-written Assembly kernels for 64-bit x86 CPUs βββ less_slow_aarch64.S # Hand-written Assembly kernels for 64-bit Arm CPUs βββ less_slow.cu # CUDA C++ examples for parallel algorithms for Nvidia GPUs βββ less_slow_sm70.ptx # Hand-written PTX IR kernels for Nvidia Volta GPUs βββ less_slow_sm90a.ptx # Hand-written PTX IR kernels for Nvidia Hopper GPUs
Educational content without memes?! Come on!
This benchmark suite uses most of the features provided by Google Benchmark. If you write a lot of benchmarks and avoid going to the full User Guide, here is a condensed list of the most useful features:
- ->Args({x, y})- Pass multiple arguments to parameterized benchmarks
- BENCHMARK()- Register a basic benchmark function
- BENCHMARK_CAPTURE()- Create variants of benchmarks with different captured values
- Counter::kAvgThreads- Specify thread-averaged counters
- DoNotOptimize()- Prevent compiler from optimizing away operations
- ClobberMemory()- Force memory synchronization
- ->Complexity(oNLogN)- Specify and validate algorithmic complexity
- ->SetComplexityN(n)- Set input size for complexity calculations
- ->ComputeStatistics("max", ...)- Calculate custom statistics across runs
- ->Iterations(n)- Control exact number of iterations
- ->MinTime(n)- Set minimum benchmark duration
- ->MinWarmUpTime(n)- To warm up the data caches
- ->Name("...")- Assign custom benchmark names
- ->Range(start, end)- Profile for a range of input sizes
- ->RangeMultiplier(n)- Set multiplier between range values
- ->ReportAggregatesOnly()- Show only aggregated statistics
- state.counters["name"]- Create custom performance counters
- state.PauseTiming(),- ResumeTiming()- Control timing measurement
- state.SetBytesProcessed(n)- Record number of bytes processed
- state.SkipWithError()- Skip benchmark with error message
- ->Threads(n)- Run benchmark with specified number of threads
- ->Unit(kMicrosecond)- Set time unit for reporting
- ->UseRealTime()- Measure real time instead of CPU time
- ->UseManualTime()- To feed custom timings for GPU and IO benchmarks