Search code, repositories, users, issues, pull requests...

Copy link

@dloghin dloghin commented Nov 28, 2024 •

edited

Loading

Description

This PR aims to integrate zeknox GPU-acceleration library into gnark. Specifically, this PR targets the GPU (NVIDIA CUDA) acceleration of groth16 backend over BN254. In addition, this PR adds a new example consisting of proving/verifying a batch of secp256r1 (P256) signatures. Our benchmarking shows 1.54-1.57X speedup of the CPU+GPU execution (with zeknox) compared to the default CPU-only execution.

In summary, we did the following addition:

accelerated groth16 over BN254 with zeknox under backend/groth16/bn254/zeknox folder.
timing in backend/groth16/bn254/prove.go printed in debug mode.
a code example of proving/verifying a batch of secp256r1 (P256) signatures under examples/p256.
instructions in README.md on how to run gnark with zeknox.

Type of change

New feature (non-breaking change which adds functionality)
This change requires a documentation update

How has this been tested?

We wrote new tests under backend/groth16/bn254/zeknox and examples/p256. In addition, we also run tests under backend/groth16/bn254.

Test A: backend/groth16/bn254

cd backend/groth16/bn254
go test
go test -tags zeknox

Test B: backend/groth16/bn254/zeknox

cd backend/groth16/bn254/zeknox
go test
go test -tags zeknox

Test C: examples/p256

cd examples/p256
go test
go test -tags zeknox

Test D: examples/mimc

cd examples/mimc
go test
go test -tags zeknox

How has this been benchmarked?

Benchmark A

We ran the P256 example to prove/verify a batch of 10 secp256r1 keys. The steps to run:

cd examples
go build -tags zeknox
./examples

Platform A: on Google Cloud Platform g2-standard-32 instance with 32 vCPU (cores) of Intel Xeon type, one NVIDIA L4 GPU, and 128 GB RAM.
Platform B: on a x86-64 AMD Ryzen 9 5950X CPU with 16 cores (32 threads), one NVIDIA RTX 4080 GPU, and 128 GB RAM.

Results

The times below represent the proving time (in milliseconds) for 10 secp256r1 keys.

Platform	CPU-only	CPU+GPU (zeknox)	Speedup
Platform A	5840.96 ms	3792.48 ms	1.54 X
Platform B	4066.95 ms	2588.51 ms	1.57 X

Checklist:

I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added tests that prove my fix is effective or that my feature works
I did not modify files generated from templates
golangci-lint does not output errors locally
New and existing unit tests pass locally with my changes
Any dependent changes have been merged and published in downstream modules

doutv and others added 30 commits

October 10, 2024 12:00


 log msm g1 g2 time, and add comment

e0a9c5a


 log computeH time

48e8cc7


 init zeknox GPU acceleration

0a93689


 MSM G1 & G2 acclerating! with local cuda repo

e566588


 sequencial GPU MSM & refactor

ffd2781


 mimc test gpu acceleration

e26498a


 fix verify bug, delete channel

e6eb42e


 add p256 example for testing

31479f7


 parallel but verify fail in most cases

c481b0a


 fix msm cfg & add input check

4cb00d6


 fix parallel GPU proving, use errgroup

4d43c3e


 small fix

6895eb1


 add doc

f52ad4c


 set msm LargeBucketFactor config

18e871d


 update msmg1, msmg1 return affine

88237e2


 fix cuda int

26ba8db


 generate witness in every prove

d99fc59


 delete unused deviceInfo

6a13e9b


 deviceInfo each points store ArePointsInMont

59bccd9


 update cuda library, verify GPU proof success!

fd2ace4


 refactor msm, 1 msm func for both G1 and G2

a2694e0


 parallel msm, sometimes verify fail


 parallel + copy point every time

8c49c6c


 serial GPU msm, always success

8d84de2


 small improvement in zeknox prover

554277a


 improve p256 example

48475b8


 update zeknox to v1.0.0

48d719d


 init zeknox GPU acceleration

2e0ef67


 MSM G1 & G2 acclerating! with local cuda repo

2da1853


 sequencial GPU MSM & refactor

70d36b2

dloghin and others added 4 commits

November 28, 2024 12:04


 update readme

8fccf0c


 Merge branch 'Consensys:master' into master

28c52db


 merged with upstream master

aaa21ac


 clean source code

b1dc00f

@dloghin dloghin marked this pull request as ready for review

November 29, 2024 03:42

@doutv

Copy link

Contributor

doutv commented Dec 9, 2024

@ivokub need your help and review!

Copy link

Collaborator

ivokub commented Dec 9, 2024

@ivokub need your help and review!

On it. Would it be possible to allow adding commits directly to the branch for easier review?

@ivokub ivokub self-requested a review

December 9, 2024 23:57

@ivokub ivokub added type: new feature type: consolidate labels

Dec 9, 2024

@ivokub ivokub added this to the v0.12.0 milestone

Dec 9, 2024

@doutv

Copy link

Contributor

doutv commented Dec 10, 2024

@ivokub need your help and review!

On it. Would it be possible to allow adding commits directly to the branch for easier review?

Sure, I've add grant you push permission in https://github.com/okx/gnark/invitations

Let me delete those examples to keep the PR clean

doutv added 3 commits

December 10, 2024 18:41


 revert examples

b7da2ae


 delete comment

24c51c0


 add zeknox sha3 example

Copy link

Collaborator

ivokub commented Dec 13, 2024

I'm not able to create a proof for now, in the debug logs I see the last action is:

�[90m14:05:05�[0m DBG Bs.MultiExp done �[36mMSMG2 5 took=�[0m0.86421 �[36macceleration=�[0mzeknox �[36mbackend=�[0mgroth16 �[36mcurve=�[0mbn254 �[36mnbConstraints=�[0m6

I guess it is probably some deadlock somewhere. Have you been able to run end-to-end prover?

Copy link

Author

dloghin commented Dec 16, 2024

Hi Ivo,

May I check: if you use the precompiled zeknox libraries, does your GPU have compute capability 8.6 or 8.9? (only these two are supported by our precompiled libraries).

On our systems, the end-to-end example (go run -tags=zeknox examples/zeknox/main.go) is working.

Copy link

Collaborator

ivokub commented Dec 16, 2024

Hi Ivo,

May I check: if you use the precompiled zeknox libraries, does your GPU have compute capability 8.6 or 8.9? (only these two are supported by our precompiled libraries).

On our systems, the end-to-end example (go run -tags=zeknox examples/zeknox/main.go) is working.

I'm using AWS g4dn.xlarge instance which by documentation is T4. And it seems it is compute capability 7.5.

Should it work if I compile the libraries myself? I started compiling them, but it took quite a bit of time and I didn't let it terminate. When I benchmarked previously, then g4dn was quite good balance between performance and $-per-proof cost.

@doutv

Copy link

Contributor

doutv commented Dec 16, 2024

Yeah, compile by yourself should work. Compile BN254 MSM G2 takes ~5mins on our device. expect a long compile time

@doutv

Copy link

Contributor

doutv commented Dec 16, 2024

Use this script
https://github.com/okx/zeknox/blob/main/native/build-release-msm-bn254.sh

Copy link

Collaborator

ivokub commented Dec 16, 2024

Use this script https://github.com/okx/zeknox/blob/main/native/build-release-msm-bn254.sh

Indeed I got it working and the speedup is similar to the one claimed in the PR (1.6x). I also had to build libblst.

But now it seems that there is an issue with the proof, I get invalid proof:

panic: points in the proof are not in the correct subgroup

I could try looking into it, but it would probably take a bit time to compare the computed values against CPU execution - would it be possible to try out with another GPU and see if you hit the same problem?

@doutv

Copy link

Contributor

doutv commented Dec 17, 2024 •

edited

Loading

This is an edge case. We found this bug, tried many methods to fix it, but it still happens...
I will look into it.

dloghin added 4 commits

January 9, 2025 15:52


 add warmup and multiple runs flag

ea139bc


 merged gnark master into this branch


 bump to zeknox v1.0.1

50a3d42


 fix GPU invalid proof using workaround

d7b7c95

Copy link

Author

dloghin commented Feb 20, 2025

Hi @ivokub, my latest commit fixes (temporarily) the issue with invalid proof. We observe that this issue appears in multi-GPU environments with relatively low frequency but we did not find the reason. If the proof is invalid, we recompute only the invalid points on CPU. We still observe 25-50% speedup even when this issue appears. Please review. Thank you.