Skip to main content
Code Review

Return to Answer

added 165 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I wentended up with the latterformer even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members. Move Phi2 info back into Primecount
  • Add defaults for block size (WIP2^20) and alpha (scales with log^3 x) for my machine
  • Fix constants to work with small inputs

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

The paper also mentions dynamically sizing intervals and potentially replacing the Fenwick tree with a linear array for smaller y. Apparently the tree is terrible for cache locality. I don't plan on implementing these.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I ended up with the former even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members. Move Phi2 info back into Primecount
  • Add defaults for block size (2^20) and alpha (scales with log^3 x) for my machine
  • Fix constants to work with small inputs

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

The paper also mentions dynamically sizing intervals and potentially replacing the Fenwick tree with a linear array for smaller y. Apparently the tree is terrible for cache locality. I don't plan on implementing these.

added 132 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.
  • Refactor some functions to make more use of input arguments and return values instead of directly modifying struct members (WIP)

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

added 154 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by using __int128cube. This should be good for any feasible input of X, asI can either rewrite to avoid overflow or use GCC's __int128. I went with the largest value used currentlylatter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single cube(IACBRTX+1)mul, which exceeds (signed instruction) INT64_MAX ≈ 9e18. Now
  • Fiddle with reducing integer sizes, any value would havemainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to exceed INT128_MAX ≈ 1.7e38save memory and fit into cache.

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3, and 5, 7. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 by using __int128. This should be good for any feasible input of X, as the largest value used currently is cube(IACBRTX+1), which exceeds (signed) INT64_MAX ≈ 9e18. Now, any value would have to exceed INT128_MAX ≈ 1.7e38.

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3, 5, 7. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

Here I will keep a running list of improvements I've made since posting.

  • Clean up Alg 2 to remove continue statements and early return (these should have no effect on the asm generated, but look a little cleaner)
  • Optimize Fenwick tree to not call constructor every time in PhiBlock (no performance improvement but better to be explicit about it) AND use linear time instead of O(n log n) construction (maybe 10% speedup actually)
  • Fix overflow issues, detected with GCC -ftrapv, at X = 1e15 caused by cube. I can either rewrite to avoid overflow or use GCC's __int128. I went with the latter even though __int128 is exact integer math and probably faster (128-bit result can be computed in a single mul instruction).
  • Fiddle with reducing integer sizes, mainly PhiBlock using 32-bit ints since the block size is less than 32-bit, to save memory and fit into cache.

I don't have any grand optimization ideas, but cursory usage of perf suggests idiv in the main calculation of y is an expensive instruction (discussion in comments). By far the most time is spent in S2_iter, but it is impossible to get rid of the y divide because it's used to index into phi_block. Maybe ceil_div can be optimized but none run in the hot loop I believe.

Faster programs like Kim Walisch's primecount save more space and iterations in the phi block by skipping more multiples of primes like 3 and 5. But each comes with quickly diminishing returns. By only using odd values as suggested in the paper, I use only half the memory/iterations while not complicating the logic much. I found this sped up by roughly 1.5x. Skipping multiples of 3 improves from 1/2 to 2/6 the size (only using 1, 5 mod 6) but makes the indexing more fiddly. Skipping multiples of 5 uses phi(30)/30 = 8/30 the size, and so on. This is worth it if I really want to optimize, but at that point I would also start to incorporate all the further improvements from Gourdon and DB Staple.

added 41 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
added 41 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
added 110 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
added 155 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
added 551 characters in body
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
Source Link
qwr
  • 1.2k
  • 1
  • 8
  • 26
Loading
lang-cpp

AltStyle によって変換されたページ (->オリジナル) /