Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I wentended up with the latterformer even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
- Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members. Move Phi2 info back into Primecount
- Add defaults for block size (WIP2^20) and alpha (scales with log^3 x) for my machine
- Fix constants to work with small inputs
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
The paper also mentions dynamically sizing intervals and potentially replacing the Fenwick tree with a linear array for smaller y. Apparently the tree is terrible for cache locality. I don't plan on implementing these.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
- Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I ended up with the former even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
- Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members. Move Phi2 info back into Primecount
- Add defaults for block size (2^20) and alpha (scales with log^3 x) for my machine
- Fix constants to work with small inputs
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
The paper also mentions dynamically sizing intervals and potentially replacing the Fenwick tree with a linear array for smaller y. Apparently the tree is terrible for cache locality. I don't plan on implementing these.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
- Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
- Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused by using__int128cube
. This should be good for any feasible input of X, asI can either rewrite to avoid overflow or use GCC's __int128. I went with the largest value used currentlylatter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlecube(IACBRTX+1)mul
, which exceeds (signed instruction) INT64_MAX ≈ 9e18. Now - Fiddle with reducing integer sizes, any value would havemainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to exceed INT128_MAX ≈ 1.7e38save memory and fit into cache.
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3, and 5, 7. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 by using__int128
. This should be good for any feasible input of X, as the largest value used currently iscube(IACBRTX+1)
, which exceeds (signed) INT64_MAX ≈ 9e18. Now, any value would have to exceed INT128_MAX ≈ 1.7e38.
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3, 5, 7. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.
Here I will keep a running list of improvements I've made since posting.
- Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
- Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
- Fix overflow issues, detected with GCC
-ftrapv
, at X = 1e15 caused bycube
. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a singlemul
instruction). - Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
I don't have any grand optimization ideas, but cursory usage of perf suggests idiv
in the main calculation of y
is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter
, but it is impossible to get rid of the y
divide because it's used to index into phi_block
. Maybe ceil_div
can be optimized but none run in the hot loop I believe.
Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.