FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.
-
Updated
Mar 18, 2026 - Cuda
FastCuda is a handwritten CUDA operator library featuring progressive GEMM and Reduce kernels, cuBLAS benchmarking, and C/C++/Python interfaces for learning, profiling, and performance optimization.
Fast SpMM implementation on GPUs for GNN (IPDPS'23)
Codes for DTC-SpMM (ASPLOS'24)
📚 Coursework for "Introduction to High Performance Computing" (30240192)
Proof-of-concept implementation of a search engine that uses sparse matrix multiplication to identify the best peptide candidates for a given mass spectrum.
RA-SpMM: Regime-Aware Sparse Matrix Multiplication for GNN Workloads on GPUs. 8-rule router, 6 preprocessing-free kernels, 3.25x over cuSPARSE (FGCS 2026).
Searching for peptide candidates using sparse matrix + matrix/vector multiplication.
Reproducible Instruction Roofline analysis of cuSPARSE and Ginkgo SpMM on RTX 4090 using Nsight Compute metrics.
Add a description, image, and links to the spmm topic page so that developers can more easily learn about it.
To associate your repository with the spmm topic, visit your repo's landing page and select "manage topics."