Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch
-
Updated
Jun 2, 2026 - Cuda
Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch
FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.
Bilingual CUDA SGEMM optimization tutorial and reference implementation, from naive kernels to Tensor Core WMMA | 双语 CUDA SGEMM 优化教程与参考实现,从朴素内核到 Tensor Core WMMA
10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).
WMMA FP16-->FP32 Tensor Core GEMM with shared-memory tiling and cp.async-style pipelining, benchmarked against cuBLAS.
Add a description, image, and links to the wmma topic page so that developers can more easily learn about it.
To associate your repository with the wmma topic, visit your repo's landing page and select "manage topics."