| ✔️ nsys/ncu(timeline/ptx/sass) |
/ |
/ |
link |
⭐️ |
| ✔️ elementwise_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ elementwise_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ elementwise_f16 |
f16 |
/ |
link |
⭐️ |
| ✔️ elementwise_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ elementwise_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ elementwise_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ histogram_i32 |
i32 |
/ |
link |
⭐️ |
| ✔️ histogram_i32x4 |
i32 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f16 |
16 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ sigmoid_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ relu_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ relu_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ relu_f16 |
f16 |
/ |
link |
⭐️ |
| ✔️ relu_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ relu_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ relu_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ gelu_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ gelu_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ gelu_f16 |
f16 |
/ |
link |
⭐️ |
| ✔️ gelu_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ gelu_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ gelu_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ swish_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ swish_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ swish_f16 |
f16 |
/ |
link |
⭐️ |
| ✔️ swish_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ swish_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ swish_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ embedding_f32 |
f32 |
/ |
link |
⭐️ |
| ✔️ embedding_f32x4 |
f32 |
/ |
link |
⭐️ |
| ✔️ embedding_f32x4_pack |
f32 |
/ |
link |
⭐️ |
| ✔️ embedding_f16 |
f16 |
/ |
link |
⭐️ |
| ✔️ embedding_f16x2 |
f16 |
/ |
link |
⭐️ |
| ✔️ embedding_f16x8 |
f16 |
/ |
link |
⭐️ |
| ✔️ embedding_f16x8_pack |
f16 |
/ |
link |
⭐️⭐️ |
| ✔️ mat_trans_f32_col2row{2d} |
f32 |
/ |
link |
⭐️ |
| ✔️ mat_trans_f32_row2col{2d} |
f32 |
/ |
link |
⭐️ |
| ✔️ mat_trans_f32_diagonal2d |
f32 |
/ |
link |
⭐️⭐️ |
| ✔️ mat_trans_f32x4_col2row{2d} |
f32 |
/ |
link |
⭐️⭐️ |
| ✔️ mat_trans_f32x4_row2col{2d} |
f32 |
/ |
link |
⭐️⭐️ |
| ✔️ warp_reduce_[all] |
all |
all |
link |
⭐️⭐️ |
| ✔️ reduce_f32_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_f32x4_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_f16_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_f16_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_f16x2_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_f16x2_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_f16x8_pack_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_f16x8_pack_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16_bf16 |
bf16 |
bf16 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16_f32 |
bf16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16x2_bf16 |
bf16 |
bf16 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16x2_f32 |
bf16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16x8_pack_bf16 |
bf16 |
bf16 |
link |
⭐️⭐️ |
| ✔️ reduce_bf16x8_pack_f32 |
bf16 |
f32 |
link |
⭐️⭐️ |
| ✔️ reduce_fp8_e4m3_f16 |
fp8_e4m3 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_fp8_e5m2_f16 |
fp8_e5m2 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_fp8_e4m3x16_pack_f16 |
fp8_e4m3 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_fp8_e5m2x16_pack_f16 |
fp8_e5m2 |
f16 |
link |
⭐️⭐️ |
| ✔️ reduce_i8_i32 |
i8 |
i32 |
link |
⭐️⭐️ |
| ✔️ reduce_i8x16_pack_i32 |
i8 |
i32 |
link |
⭐️⭐️ |
| ✔️ dot_product_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ dot_product_f32x4 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ dot_product_f16_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ dot_product_f16x2_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ dot_product_f16x8_pack_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ softmax_f32(fence) |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ softmax_f32x4(fence) |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ softmax_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ softmax_f32x4 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ safe_softmax_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ safe_softmax_f32x4 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ safe_softmax_f16_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ safe_softmax_f16x2_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ safe_softmax_f16x8_pack_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ online_safe_softmax_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ online_safe_softmax_f32x4_pack |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ rope_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ rope_f32x4_pack |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f32x4 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16x2_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16x8_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16x8_pack_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16x8_pack_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ layer_norm_f16_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f32x4 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16x2_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16x8_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16x8_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16x8_pack_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16x8_pack_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ rms_norm_f16_f32 |
f16 |
f32 |
link |
⭐️⭐️ |
| ✔️ sgemm_naive_f32 |
f32 |
f32 |
link |
⭐️⭐️ |
| ✔️ sgemm_sliced_k_f32 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_t_8x8_sliced_k_f32x4 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_t_8x8_sliced_k...bcf |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_t_8x8_sliced_k...dbuf |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_t_8x8_sliced_k16...dbuf |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_t_8x8_sliced_k16...async |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_wmma_m16n16k8...stages* |
tf32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemm_wmma_m16n16k8...swizzle* |
tf32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_naive_f16 |
f16 |
f16 |
link |
⭐️⭐️ |
| ✔️ hgemm_sliced_k_f16 |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8x8_sliced_k_f16x4 |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8x8_sliced_k_f16x4_pack |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8x8_sliced_k_f16x8_pack |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8x8_sliced_k...dbuf |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8/16x8...k16/32...dbuf |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_t_8/16x8...k16/32...async |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...naive* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...mma4x2* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...mma4x4* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...dbuf* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m32n8k16....dbuf* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...stages* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_wmma_m16n16k16...swizzle* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_mma_m16n8k16...naive* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_mma_m16n8k16...mma2x4* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_mma_m16n8k16...stages* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemm_mma_m16n8k16...swizzle* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemv_k32_f32 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemv_k128_f32x4 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ sgemv_k16_f32 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemv_k32_f16 |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemv_k128_f16x4 |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ hgemv_k16_f16 |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ flash_attn_1_fwd_f32 |
f32 |
f32 |
link |
⭐️⭐️⭐️ |
| ✔️ flash_attn_2_fwd_f16_m16n8k16* |
f16 |
f16 |
link |
⭐️⭐️⭐️ |
| ✔️ nms_f32 |
f32 |
/ |
link |
⭐️⭐️ |
| ✔️ notes v1(deprecated) |
f32 |
f32 |
/ |
⭐️ |