News - HipKittens unleashes a tile programming based abstraction layer for AMD GPUs | AnandTech Forums: Technology, Hardware, Software, and Deals

marees

Golden Member

: Apr 28, 2024

: 1,984

: 2,612

: 96

Nov 11, 2025

AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly.

To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specificlanguages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.

We explore the extentto which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.

We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD.

We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines.

Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by 1.2 − ×ばつ (e.g., d = 64 attention, GQA backwards, memory-bound kernels).

These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors.

HipKittens is released at: https://github.com/HazyResearch/HipKittens.

https://twitter.com/x/status/1988393252555493739

https://twitter.com/x/status/1988320532585214231

Reactions: Joe NYC, lightmanek and Elfear

marees

Golden Member

: Apr 28, 2024

: 1,984

: 2,612

: 96

Nov 11, 2025

In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad "tinykittens", TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels.

The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

We find:

The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable.

https://hazyresearch.stanford.edu/blog/2025-11-09-hk

Reactions: lightmanek

marees

Golden Member

: Apr 28, 2024

: 1,984

: 2,612

: 96

Nov 11, 2025

Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # C	MFMA Shape	Output	TFLOPS
HK 4 / 8	×ばつ32	×ばつ256	893
HK 4 / 12	×ばつ32	×ばつ256	1278
HK 0 / 8	×ばつ32	×ばつ256	1281
HK 0 / 8	×ばつ32	×ばつ256	1605
TK	×ばつ16	×ばつ256	1538
CUTLASS	×ばつ16	×ばつ256	1570

Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But... AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., ×ばつ32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.

These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x

https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

marees

Golden Member

: Apr 28, 2024

: 1,984

: 2,612

: 96

Nov 11, 2025

marees said:

Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # C MFMA Shape Output TFLOPS
HK 4 / 8 ×ばつ32 ×ばつ256 893
HK 4 / 12 ×ばつ32 ×ばつ256 1278
HK 0 / 8 ×ばつ32 ×ばつ256 1281
HK 0 / 8 ×ばつ32 ×ばつ256 1605
TK ×ばつ16 ×ばつ256 1538
CUTLASS ×ばつ16 ×ばつ256 1570

Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But... AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., ×ばつ32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!

4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.

These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x

https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

Click to expand...

Links: Arxiv | Code

We believe that the HIPCC register scheduling is one of the most important areas for improvement in AMD's kernel software stack.↩
We hope these findings lead to hardware changes that support wave specialization or guide AMD kernel development; for instance, Mojo currently provides a warp-specialized matmul kernel as of 11/06/2025 even though AMD CDNA doesn’t have register reallocation.

Shmee

Memory & Storage, Graphics Cards Mod Elite Member

Super Moderator

: Sep 13, 2008

: 8,268

: 3,154

: 146

Nov 15, 2025

I don't know what this is about, but I like kittens!

Haha
Like

Reactions: Tlh97, bba-tcg, lightmanek and 3 others

Search

NewsHipKittens unleashes a tile programming based abstraction layer for AMD GPUs

marees

Golden Member

marees

Golden Member

marees

Golden Member

marees

Golden Member

Shmee

Memory & Storage, Graphics Cards Mod Elite Member

TRENDING THREADS