Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Releases: Zaneham/BarraCUDA

BarraCUDA 0.5

29 May 04:30
@Zaneham Zaneham
cdbbf67
This commit was created on GitHub.com and signed with GitHub’s verified signature.
GPG key ID: B5690EEEBB952194
Verified
Learn about vigilant mode.

Choose a tag to compare

BarraCUDA 0.5

The first tagged release. The headline is that you can write a Triton kernel, matmul and all, and run it on a CPU with
no GPU. The --cpu backend lowers BIR straight to x86-64 with the SIMT model collapsed into a thread loop, and the
rank-2 tile path materialises and unrolls so tl.dot plus a K-loop sweeps an arbitrary contraction.

New in this cycle

  • CPU backend (--cpu). CUDA and Triton kernels compile to a host object and run natively. Headline demo:
    examples/cpu_launch_matmul.c.
  • RISC-V backend (--rv64). Same idea, RV64IMFD objects that run under qemu.
  • Cross-backend differential testing (tests/diff/). Same BIR through two backends, diff the output buffers, CPU
    is the oracle. Every case runs --inject so a green result actually means something.
  • Triton scalar math intrinsics. exp, log, sin, cos, tan, tanh, sqrt, rsqrt, abs, floor,
    ceil, maximum, minimum, fdiv. Thanks to @shivam2931120 for the PR, radians-to-turns convention done right.
  • Triton constexpr ABI compaction. tl.constexpr params with defaults fold to literals and drop out of the
    runtime signature.
  • CUDA fixes. --cpu / --rv64 on their own now run sema and don't trip the parse-dump fallback. Typedef-struct
    kernels compile through --cpu, and --parse no longer segfaults on synthetic anon names.
  • --version flag.

Targets

  • AMD GPU. --amdgpu-bin for ELF code objects (.hsaco), --amdgpu for assembly. CDNA 2 (gfx90a, MI250), CDNA
    3 (gfx942, MI300X), RDNA 2 (gfx1030), RDNA 3 (gfx1100), RDNA 4 (gfx1200).

  • NVIDIA PTX. --nvidia-ptx, defaults to sm_89. JIT-loaded via the CUDA Driver API.

  • Tenstorrent Metalium C++. --tensix, SFPU compute.

  • Tenstorrent baby cores. --rv-elf, native RV32IM ELF via the TDF layer. Integer kernels for now; soft-float
    runtime exists but not yet wired in.

  • x86-64 host object. --cpu, links and runs on Linux.

  • RV64IMFD. --rv64, runs under qemu-riscv64.

  • Apple Metal MSL. --metal, stub backend, hardware validation pending.

  • Intel SPIR-V for Xe. --intel-spirv, stub backend.

    Frontends

  • CUDA C. The same .cu files you'd feed nvcc. Real preprocessor (#include, function-like macros,
    #ifdef/#if/#elif).

  • HIP. --hip or .hip files, CUDA's AMD sibling with the same kernel-language shape.

  • Triton. @triton.jit Python source through a from-scratch lex/parse/sema/lower. Rank-1 and rank-2 tiles,
    tl.dot, K-loop tiling, the math intrinsics above.

    Mainframe curios

  • ABEND dumps (src/runtime/bc_abend.*). GPU faults become IBM-style G0Cx completion codes, correlated against
    tracked allocations with a dispatch snapshot. Fires automatically on the HSA path.

  • SNAP (--snap). Per-kernel parameter dump on entry. AMD only for now.

  • SYSPRINT. Class-tagged structured kernel output, pattern-routed sinks on the host. Demo:
    examples/sysprint_kernel.cu + examples/launch_sysprint.c.

  • TDF (Tile DataFlow). The layer above BIR for dataflow targets: regions, channels, NoC arcs, L1 placement,
    multi-core fission. Dump with --tdf.

    Runtime + tooling

  • HSA runtime launcher (src/runtime/bc_runtime.h). Loads libhsa-runtime64.so at runtime via dlopen, so the
    launcher itself has no compile-time ROCm dependency.

  • Bilingual errors (--lang <file>). Te reo Māori translation included; the format is data, so any language with
    a translation file works.

  • Optimisation passes. mem2reg, constant folding, dead code elimination. Each one is skippable (--no-mem2reg /
    --no-cfold / --no-dce) for bisection.

  • Differential testing harness (tests/diff/), as above.

    Validated on real silicon

  • AMD MI300X (CDNA 3, GFX942). 8/8 test kernels passing. Moa Monte Carlo neutron transport produces correct
    physics (k_eff = 0.995 vs reference 1.000).

  • AMD RDNA3 (GFX1100). Full test suite via the tinygrad mockgpu emulator in CI.

  • NVIDIA RTX 4060 Ti. Moa transport benchmark produces correct results with a 3.8x speedup over single-thread CPU.
    No NVCC anywhere in the pipeline.

  • Tenstorrent Blackhole. Compiles to valid Metalium C++.

This is not the full changelog as this is the first "release". See CHANGELOG.txt for the full prose.

Thanks to the people who've contributed, especially @nataliakokoromyti and @shivam2931120 and other people who've sent in tips and tricks and raised issues.

Contributors

nataliakokoromyti and shivam2931120
Assets 2
Loading
kimstik reacted with hooray emoji
1 person reacted

AltStyle によって変換されたページ (->オリジナル) /