Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

JustAResearcher/Latency-Based-GPU-Algorithm

Repository files navigation

GPUx — ASIC-resistant PoW for GPUs

Build & Release License: MIT Latest Release

Status: v0.1.5 community testing Target: Replacement for Cuckaroo29 (C29) in Tari (XTM) Goal: GPU-native, ASIC-resistant proof-of-work; low power; cheap verifier.


What this is

GPUx is a candidate proof-of-work algorithm designed to make GPU mining durable against ASIC takeover. It combines random per-epoch programs, a 2 GiB random-access DAG, and a per-thread scratchpad to force any would-be ASIC into looking like a GPU — at which point the ASIC has no cost advantage.

Three artifacts in this repo:

  1. Algorithm spec (ALGORITHM_SPEC.md) — formal definition.
  2. Reference C implementation (spec/) — the authoritative semantics.
  3. CUDA implementation + bench harness (cuda/, bench/) — what community testers run on their GPUs.

If you are a community tester, jump to COMMUNITY_TESTING.md.

If you are reviewing the algorithm, start with ALGORITHM_SPEC.md and then docs/DESIGN_RATIONALE.md.


Quick numbers (RTX 5090, unoptimized v0.1 kernel)

Metric Value
Hashrate ~1.25 MH/s
DAG generation 2 GiB in ~30 ms (~65 GB/s)
Per-share verify ~0.5 ms (warm DAG)
GPU vs reference bit-identical (5/5 KAT nonces)

These are baseline numbers from a reference port. Optimized kernels (warp-cooperative DAG access, shared-memory scratchpad, instruction reordering) are expected to multiply throughput ×ばつ without changing consensus.


Why GPUx is hard for ASICs (one-screen summary)

ASICs win when the algorithm is small, homogeneous, and predictable. GPUx attacks each premise:

Property GPUx mechanism
Predictable kernel Random program regenerated every 1024 blocks
Small kernel 256 ops ×ばつ 64 iters = 16 384 ops/nonce, 12 distinct opcodes, 32 64-bit lanes
Cheap memory 2 GiB DAG with random dependent access (forces GDDR/HBM)
No cache 16 KiB per-thread scratchpad with R-M-W (forces L1-equivalent)
One datapath Mix of 64-bit int ALU, MULHI, AES round, IEEE-754 FP32 FMA
Throughput parallel Latency-bound dependent chains limit pipelining

Long-form analysis with comparisons to Ethash, ProgPoW, RandomX, Cuckaroo, and X16R is in docs/DESIGN_RATIONALE.md.


Repo layout

gpux/
├── ALGORITHM_SPEC.md formal algorithm spec
├── COMMUNITY_TESTING.md how to run tests and submit results
├── README.md this file
├── Makefile builds reference + tests (Linux/WSL/macOS)
├── spec/ reference C implementation
│ ├── gpux.h / gpux.c algorithm reference (the source of truth)
│ ├── blake2b.c+h embedded BLAKE2b reference
│ ├── chacha20.c+h embedded ChaCha20 reference
│ ├── aes_round.c+h embedded AES single-round reference
│ └── test_vectors.h frozen KAT (regenerate with `make gen-kat`)
├── tests/
│ ├── smoke.c primitive correctness (BLAKE2b, ChaCha20, AES, KAT generators)
│ ├── kat.c full hash KAT (allocates 2 GiB)
│ └── gen_kat.c regenerate test_vectors.h
├── cuda/ CUDA implementation
│ ├── gpux_kernel.cu the mining kernel
│ ├── gpux_device.cuh device-side BLAKE2b/ChaCha20/AES
│ ├── gpux_miner.cu host driver: verify, bench, info
│ ├── Makefile Linux/WSL build
│ └── build.bat Windows build (vcvars + nvcc)
├── bench/ community testing
│ ├── run_bench.ps1 Windows harness
│ ├── run_bench.sh Linux harness
│ └── results/ per-GPU JSON results (created on first run)
└── docs/
 └── DESIGN_RATIONALE.md why each design choice; ASIC-resistance argument

Building

Linux / WSL / macOS (reference + tests)

make smoke # primitive tests, no DAG
make kat # full KAT (allocates 2 GiB)

Linux / WSL / macOS (CUDA)

cd cuda && make
./gpux_miner verify
./gpux_miner bench 30

Windows (CUDA)

Requires Visual Studio 2022 BuildTools + CUDA 13.x.

cd cuda
.\build.bat
.\gpux_miner.exe verify
.\gpux_miner.exe bench 30

Or use the testing wrapper:

.\bench\run_bench.ps1 -Seconds 60

Tari integration (proposed)

Tari's existing block header is hashed with BLAKE2b-256 to produce a 32-byte digest. To use GPUx as a PoW algorithm:

header_digest = BLAKE2b-256(serialized_block_header_excluding_nonce)
block_hash = GPUx(header_digest, nonce)

Difficulty target and Tari's multi-algo selection layer integrate at the consensus boundary. See ALGORITHM_SPEC.md §11.


v0.1 status

  • Spec frozen for testing
  • Reference C impl, deterministic
  • KAT (1 epoch_seed, 5 nonces) with bit-exact reference output
  • CUDA impl matches reference
  • Baseline RTX 5090 hashrate (1.25 MH/s)
  • Cross-vendor FP32 determinism audit (NVIDIA Ada/Hopper/Blackwell vs AMD RDNA3/RDNA4 vs Intel)
  • Light-verifier Merkle DAG witness
  • Tari multi-algo selection integration
  • Optimized CUDA kernel (warp-coop DAG, shmem scratchpad)
  • OpenCL implementation for AMD/Intel

License

MIT — see LICENSE. Bundled reference primitives (BLAKE2b, ChaCha20, AES round, Argon2id) are public-domain or CC0/Apache-2.0 and remain so under MIT. The intent is full open-source auditability — fork it, break it, propose changes via PR, run your own bench results and submit them as JSON files in bench/results/.

About

ASIC-resistant, latency-bound proof-of-work algorithm for GPUs. Proposed replacement for Cuckaroo29 (C29) in Tari XTM.

Topics

Resources

License

Stars

Watchers

Forks

Packages

Contributors

AltStyle によって変換されたページ (->オリジナル) /