[フレーム]

NewsHipKittens unleashes a tile programming based abstraction layer for AMD GPUs

JavaScript is disabled. For a better experience, please enable JavaScript in your browser before proceeding.
You are using an out of date browser. It may not display this or other websites correctly.
You should upgrade or use an alternative browser.

marees

Golden Member
Apr 28, 2024
1,984
2,612
96
AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly.

To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specificlanguages like ThunderKittens(TK) to simplify high performance AI kernel development on NVIDIA hardware.


We explore the extentto which such primitives — for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers —are NVIDIA-specific or general.

We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD.

We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD’s hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines.

Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by 1.2 − ×ばつ (e.g., d = 64 attention, GQA backwards, memory-bound kernels).

These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors.

HipKittens is released at: https://github.com/HazyResearch/HipKittens.


marees

Golden Member
Apr 28, 2024
1,984
2,612
96
In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad "tinykittens", TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels.


The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.



We find:
  1. The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
  2. Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
  3. Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.
Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable.


https://hazyresearch.stanford.edu/blog/2025-11-09-hk
  • Like
Reactions: lightmanek

marees

Golden Member
Apr 28, 2024
1,984
2,612
96
Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # CMFMA ShapeOutputTFLOPS
HK 4 / 8×ばつ32×ばつ256893
HK 4 / 12×ばつ32×ばつ2561278
HK 0 / 8×ばつ32×ばつ2561281
HK 0 / 8×ばつ32×ばつ2561605
TK×ばつ16×ばつ2561538
CUTLASS×ばつ16×ばつ2561570
Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But... AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., ×ばつ32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

  1. 8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
  2. 4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.
These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x


https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

marees

Golden Member
Apr 28, 2024
1,984
2,612
96
Wave specialization struggles on AMD. Wave specialization is the dominant paradigm for achieving high occupancy on modern NVIDIA GPUs. Producer waves focus on memory movement while consumer waves focus on computation. This strategy underpins today’s state-of-the-art AI kernels—including FlashAttention-3, COMET for MOE models, and high-performance GEMMs —as well as kernel DSLs such as ThunderKittens LSCF and TileLang.

But, we show that wave specialization underperforms on AMD due to the lack of register reallocation. On the MI355X, registers are statically divided across all waves. Producer waves that only need a few registers for address calculation are allocated more registers than they need; consumer waves cannot recoup those registers and must either spill registers to scratch memory or run at a lower arithmetic intensity. Both are disastrous for performance. Wave specialization limits the output tile size and makes our kernels more memory bound. For GEMMs, data loaded from memory is O(MK + NK) while compute is O(MNK). Decreasing the M or N in our per thread block output tile size lowers arithmetic intensity. 2

# P / # CMFMA ShapeOutputTFLOPS
HK 4 / 8×ばつ32×ばつ256893
HK 4 / 12×ばつ32×ばつ2561278
HK 0 / 8×ばつ32×ばつ2561281
HK 0 / 8×ばつ32×ばつ2561605
TK×ばつ16×ばつ2561538
CUTLASS×ばつ16×ばつ2561570
Figure: Wave specialization underperforms on AMD GPUs. We benchmark AMD GEMMs on the MI355X using different numbers of producers (P) and consumer (C) waves. We report the matrix core intrinsic shape, output tile size computed per thread block, and TFLOPs (500 iterations warmup / 100 iterations measured). The CUTLASS GEMM is selected and tuned using the CUTLASS profiler tool on a B200 GPU.

As an aside, it might be surprising that AMD matches NVIDIA GEMM performance without all the bells and whistles of wgmma/tma, producer consumer, TMA, mbarriers, large shared memory for deep multi-stage pipelining etc. But... AMD has a 2x larger register file and AMD’s smaller tensor core shapes (e.g., ×ばつ32 ) provide an alternative path to establish deep pipelines by using finer-granularity load and compute stages.

Scheduling patterns for AMD. Our attempt to use wave specialization - a strategy that works well on NVIDIA GPUs - did not yield the expected speedups on AMD hardware. All is not lost! We found two scheduling patterns that consistently yield high occupancy AMD GPUs, while using tile programming primitives (no raw assembly)!

  1. 8-wave ping-pong: We assign two waves per SIMD and at any given time, one is executing a cluster of memory instructions while the other wave executes a cluster of compute instructions. The waves swap at the end of cluster execution. With this approach, the developer can use large HK tiles since a thread issues many of the same instructions at once!
  2. 4-wave interleave: We assign one wave per SIMD and threads in this wave finely switch between issuing memory and compute operations. Here, the developer uses small HK tiles (essentially matching the size of the matrix core instruction shape) to achieve the fine-grained schedule.
These two patterns tradeoff programmability and performance, where 8-wave and its large tile primitives lead to compact code and 4-wave fine-grained interleaving expands code size. Surprisingly, the 8-wave schedule is sufficient to achieve SoTA-level performance on GEMMs and attention forwards. For GQA non-causal attention backwards, 8-wave also outperforms all AMD baselines by 1.8x, and our HK 4-wave further outperforms by 2.3x


https://hazyresearch.stanford.edu/blog/2025-11-09-amd-brr

Links: Arxiv | Code


  1. We believe that the HIPCC register scheduling is one of the most important areas for improvement in AMD's kernel software stack.
  2. We hope these findings lead to hardware changes that support wave specialization or guide AMD kernel development; for instance, Mojo currently provides a warp-specialized matmul kernel as of 11/06/2025 even though AMD CDNA doesn’t have register reallocation.

AnandTech is part of Future plc, an international media group and leading digital publisher. Visit our corporate site.
© Future Publishing Limited Quay House, The Ambury, Bath BA1 1UA. All rights reserved. England and Wales company registration number 2008885.
RESOURCES
FOLLOW
Top Bottom

AltStyle によって変換されたページ (->オリジナル) /