Skip to content

#

wmma

Here are 5 public repositories matching this topic...

lavawolfiee / mini-flash-attention

Minimal FlashAttention in CUDA C++/CuTe: readable WMMA/CuTe kernels, no NxN workspace, up to 4.5x faster than naive PyTorch

cuda attention cutlass cute gpu-kernels pytorch-extension tensor-cores llm flash-attention flashattention wmma

Updated Jun 2, 2026
Cuda

loveSunning / FastCuda

FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.

reduce spmv sgemm spmm cudac sgemv tensor-core hgemm flash-attention wmma

Updated Mar 18, 2026
Cuda

LessUp / sgemm-optimization

Bilingual CUDA SGEMM optimization tutorial and reference implementation, from naive kernels to Tensor Core WMMA | 双语 CUDA SGEMM 优化教程与参考实现,从朴素内核到 Tensor Core WMMA

tutorial cuda matrix-multiplication high-performance-computing cuda-kernels shared-memory gemm sgemm gpu-optimization bank-conflict tensor-cores wmma

Updated May 28, 2026
Cuda

Yoonkyu-Lee / batched-lenet-cuda

10,000-image LeNet-5 forward pass in ~28 ms on a single A40 via fused convolution and Tensor Cores (TF32).

parallel-computing cuda inference cnn matrix-multiplication lenet convolution gpu-computing ampere gpu-programming lenet-5 im2col cuda-programming tensor-cores kernel-optimization tf32 wmma

Updated Apr 26, 2026
Cuda

Pupking / 02_mixed_precision_gemm

WMMA FP16-->FP32 Tensor Core GEMM with shared-memory tiling and cp.async-style pipelining, benchmarked against cuBLAS.

cuda cublas gpu-performance mixed-precision tensor-cores nsight-compute wmma

Updated Apr 20, 2026
Cuda

Improve this page

Add a description, image, and links to the wmma topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the wmma topic, visit your repo's landing page and select "manage topics."