Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Comments

Implement Fast CIOS for Montgomery modular multiplication#1009

Open
clementjuventin wants to merge 2 commits intoipsilon:master from
clementjuventin:fast-montgomery-cios
Open

Implement Fast CIOS for Montgomery modular multiplication #1009
clementjuventin wants to merge 2 commits intoipsilon:master from
clementjuventin:fast-montgomery-cios

Conversation

@clementjuventin
Copy link

@clementjuventin clementjuventin commented Sep 17, 2024
edited
Loading

This PR implements the improvement of the CIOS variant of the Montgomery algorithm showcased in the following document: EdMSM: Multi-Scalar-Multiplication for SNARKs and Faster Montgomery multiplication and proposed in #869.
The optimization occurs when most_significant_p_word < (word_size / 2), p is represented by the variable mod in the current implementation.

Here are the benchmarks performed after applying these changes starting from commit 01eca77.

Build: cmake --build build --parallel

Benchmarks: taskset -c 0 ./build/bin/evmone-bench-internal --benchmark_filter=evmmax* --benchmark_repetitions=10 --benchmark_format=json --benchmark_out=cios_classic.json

Comparison:

user@user:~/sandbox/evmone$ python3 ../benchmark/tools/compare.py benchmarks cios_classic.json cios_improved.json 
Comparing cios_classic.json to cios_improved.json
Benchmark Time CPU Time Old Time New CPU Old CPU New
------------------------------------------------------------------------------------------------------------------------------------
evmmax_mul<uint256, bn254>_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, bn254>_mean +7.8884 +7.8873 27 238 27 238
evmmax_mul<uint256, bn254>_median +7.8595 +7.8584 27 237 27 237
evmmax_mul<uint256, bn254>_stddev +17.5405 +17.5057 0 3 0 3
evmmax_mul<uint256, bn254>_cv +1.0859 +1.0823 0 0 0 0
evmmax_mul<uint256, secp256k1>_pvalue 0.0002 0.0002 U Test, Repetitions: 10 vs 10
evmmax_mul<uint256, secp256k1>_mean +7.6644 +7.6636 29 249 29 249
evmmax_mul<uint256, secp256k1>_median +7.7730 +7.7721 28 249 28 249
evmmax_mul<uint256, secp256k1>_stddev +10.6559 +10.6522 1 7 1 7
evmmax_mul<uint256, secp256k1>_cv +0.3453 +0.3450 0 0 0 0
OVERALL_GEOMEAN +1.0661 +1.0658 0 0 0 0

As you can see, the results do not indicate a performance improvement.

Copy link
Author

clementjuventin commented Sep 19, 2024
edited
Loading

After further investigation, I found another way of ordering branches (second commit) and obtained what we were looking for (~15% gain on evmmax_mul<uint256, bn254>)!

evmmax_mul<uint256, bn254>_median -0.1529 -0.1529 27 23 27 23
evmmax_mul<uint256, secp256k1>_median +0.0059 +0.0058 28 28 28 28

I still don't get why this implementation is better, even assuming branch prediction makes the checks insignificant.
I would also like to compare the assembly code in case there is important optimization under the hood but I never did so let's see if I manage to do it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

No reviews

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

1 participant

AltStyle によって変換されたページ (->オリジナル) /