Optimizing primes generating program in C++

Question 1

I am writing a program to generate primes in range 1 upto 10^9 but it seems to be timing out.

I have tried optimizing my code using bitsets and pragmas.

Please are there any other ways i can optimize my code. Code below

#include<bitset>
#include<vector>
#include<iostream>
#include<algorithm>
#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
#pragma GCC optimization ("unroll-loops")
using namespace std;
#define int long long
const int LIM = 1e9;
bitset<100000000> sieve;
void compute(int limit) 
{ 
 if (limit > 2) 
 cout<<"2 ";
 if (limit > 3) 
 cout<<"3 ";
 for (int x = 1; x * x < limit; x++) { 
 for (int y = 1; y * y < limit; y++) { 
 int n = (4 * x * x) + (y * y); 
 if (n <= limit && (n % 12 == 1 || n % 12 == 5)) 
 sieve.flip(n); 
 n = (3 * x * x) + (y * y); 
 if (n <= limit && n % 12 == 7) 
 sieve.flip(n);
 n = (3 * x * x) - (y * y); 
 if (x > y && n <= limit && n % 12 == 11) 
 sieve.flip(n);
 } 
 } 
 for (int r = 5; r * r < limit; r++) { 
 if (sieve.test(r)) { 
 for (int i = r * r; i < limit; i += r * r) 
 sieve.reset(i);
 } 
 } 
 for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
 {
 if(i==0||i==1)continue;
 cout<<i<<" ";
 }
 cout<<endl;
}
signed main()
{
 compute(LIM);
 return 0; 
}

Thanks in advance.

Question 2

C++ code never times out unless you explicitly ask it to timeout. Are you saying that it takes longer than allowed?

Question 3

You may want to comment your algorithm (or link to the description of the algorithm). Without context it would be hard to provide any advice.

Question 4

i don't have time for a full review. One tiny remark that may surprise you: using a bitset is often slow. That's not always true; a well optimised program built around a bitwise data structure is often faster than the similarly optimised one built around a less densely packed one. But if you are just looping over positions in a bitset and reading/writing them one at a time, you're liable to find it slows you down. At core, that's because it takes more work to address single bits, and you get a traffic jam effect when you're repeatedly reading and writing from the same addressable bytes.

Question 5

Although not incorrect, signed main() is much less idiomatic than int main().

Question 6

@MartinYork I am using the sieve of atkin

Question 7

I'm not sure what your optimizations are doing(comments would be nice), but I did notice several inefficiencies:

Never include a calculation in the limit test in a for loop. It get's re-calculated on every iteration. In this case set the limit to sqrt(limit). the calculation is kind of expensive but it's done only once.

The same goes for the iteration step, calculate it once before you enter the loop.

In the tests, in the first set of loops, you use 3*x*x twice and y*y 3 times. putting these in variables would be a lot cheaper.

Question 8

Any decent compiler should be able to hoist the 3*x*x and y*y rather than generating code that calculates them multiple times.

Question 9

@tinstaafl can i just save the sqrt in a variable to reuse it later

Question 10

Yes, sorry if I didn't make that clear. Put into a variable would mean it's calculated once.

Question 11

@TobySpeight I find it an interesting question whether a developer should program taking into account the possible optimizations of a compiler, and to which extend.

Question 12

#define int long long

Not only is redefining int a bad idea, but using long long should be reserved for legacy code. Use <cstdint> and (given the ranges used in this code) std::uint_fast32_t.

 if (limit > 2) 
 cout<<"2 ";
 if (limit > 3) 
 cout<<"3 ";
 ... 21 lines ...
 for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
 {
 if(i==0||i==1)continue;
 cout<<i<<" ";
 }

Why not put the special cases (and the output in general) together at the end? And would it not be more transparent to do something like this?

 // Special cases: numbers smaller than the first prime or used in the wheel.
 sieve.reset(0);
 sieve.reset(1);
 sieve.set(2);
 sieve.set(3);
 sieve.set(5);

 for (int x = 1; x * x < limit; x++) {
 for (int y = 1; y * y < limit; y++) {
 int n = (4 * x * x) + (y * y);
 if (n <= limit && (n % 12 == 1 || n % 12 == 5))
 sieve.flip(n);
 n = (3 * x * x) + (y * y);
 if (n <= limit && n % 12 == 7)
 sieve.flip(n);
 n = (3 * x * x) - (y * y);
 if (x > y && n <= limit && n % 12 == 11)
 sieve.flip(n);
 }
 }

This looks rather inefficient. The first step in speed optimisation is to split out the cases and for each one iterate only over the (x,y) pairs which yield the relevant residues. E.g. for \4ドルx^2 + y^2 \in \{1, 5\} \pmod{12}\$ we have \$y \in \{1,5,7,11\} \pmod{12}\$ with any \$x\$; or \$y \equiv 3 \pmod{6}\$, \$x \not\equiv 0 \pmod{3}\$.

Not only does this save a whole lot of n % 12 == k checks, but it allows you to break out of each loop independently when the quadratic form hits limit.

Moreover, if you structure it carefully you can handle each residue with a separate bitset, which allows parallelisation.

The second speed optimisation is to look at the binary quadratic forms (BQFs) used. If you split the residues modulo 60 instead of modulo 12 then I find that there's a theoretical improvement of about 6% by using the following BQFs:

$$\begin{array}{} n \bmod 60 & \textrm{BQF} \\ 1, 19, 31, 49 & 15x^2 + y^2 \\ 7, 43 & 3x^2 + y^2 \\ 11, 59 & 3x^2 - y^2 \\ 13, 29, 37, 41 & 4x^2 + y^2 \\ 17, 23, 47, 53 & 5x^2 + 3y^2 \\ \end{array}$$

Question 13

Ha! Ha! I'm not sure that someone who write #define int ... is ready for parallelization... :-)

Question 14

@AlexisWilke, CR may be an edge case, but it's still part of the SE network, and answers can be valuable to people other than the OP.

Question 15

#define int ... is not only a bad idea, it is undefined behavior.

Question 16

Here are a few things you could try...

Including those mentioned by @tinstaafl

I think that you can avoid a lot of the computations. I had a brief look at the Wikipedia page, it seems that you are otherwise properly following the algorithm.

#include<bitset>
#include<vector>
#include<iostream>
#include<algorithm>
#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
// this rarely helps--if anything it makes the code much bigger and you lose
// on the Instruction Cache!
//#pragma GCC optimization ("unroll-loops")
// As a long time programmer, I never ever do that (I delete such statement when
// I see them!) It's never safe it ignore namespaces.
using namespace std;
// Never do that; redefine a basic type using a macro?! Big no no!
// Also with any modern C/C++, you want to use typedef instead.
#define int long long
const int LIM = 1e9;
bitset<100000000> sieve;
void compute(int limit) 
{ 
 // to really be complete, you should verify the max. size
 // since you want to be able to access `limit` you must make sure
 // that this offset is also valid
 if(limit + 1 >= sieve.size())
 throw out_of_range("limit too large");
 // Do you really need these optimizations? Do they buy you anything?
 // It looks like your loops below start at 1 anyway, so you should
 // decide on one or the other anyway!
 if (limit > 2) 
 cout<<"2 ";
 if (limit > 3) 
 cout<<"3 ";
 // as mentioned by @tinstaafl replace (x * x < limit) with (x < sqrt(limit))
 // but "cache" that limit like so:
 // also something to make a habit of is to use pre-increment, here it makes
 // no difference since all your variables are numbers, but when dealing with
 // iterators (for example) it's a different story!
 //for (int x = 1; x * x < limit; x++) { 
 int const sqrt_limit(sqrt(limit));
 for (int x = 1; x < sqrt_limit; ++x) { 
 // see note in inner loop
 x_square = x * x;
 // apply same here
 //for (int y = 1; y * y < limit; y++) { 
 for (int y = 1; y < sqrt_limit; ++y) { 
 // you calculate x * x and y * y multiple times
 // it often helps to have intermediate variables
 // (especially if you have intermediate calls which
 // could prevent the compiler from optimizing)
 // also X can be calculated outside the loop
 y_square = y * y;
 //int n = (4 * x * x) + (y * y); 
 int n = 4 * x_square + y_square; 
 if (n <= limit && (n % 12 == 1 || n % 12 == 5)) 
 sieve.flip(n); 
 // this one seems to be used twice, either it's wrong or
 // at least you can optimize by calculating it once only
 //n = (3 * x * x) + (y * y); 
 n = 3 * x_square + y_square; 
 {
 // if it is wrong and both are not the same, keep as you've done
 if (n <= limit && n % 12 == 7) 
 sieve.flip(n);
 // so? same or needs fixing?
 n = (3 * x * x) - (y * y); // <- fix this...
 if (x > y && n <= limit && n % 12 == 11) 
 sieve.flip(n);
 }
 // OR
 {
 // they are the same, then you can simplify on 'n'
 // this optimization is always available, it's up to you to
 // know which of the 3 equations give you the smallest number
 //
 // This is most certainly what you are looking for...
 if(n > limit)
 break;
 // remember that the % involves a division
 // and again the compile may or may not be able to optimize...
 // here I do it manually to be sure
 int const n_mod12 = n % 12;
 if (n_mod12 == 7)
 sieve.flip(n);
 if (x > y && n_mod12 == 11) 
 sieve.flip(n);
 }
 } 
 } 
 // same sqrt() optimization
 //for (int r = 5; r * r < limit; ++r) { 
 for (int r = 5; r < sqrl_limit; ++r) { 
 if (sieve.test(r)) { 
 // same optimization as above, pre-calculate square
 int r_square(r * r);
 //for (int i = r * r; i < limit; i += r * r) 
 for (int i = r_square; i < limit; i += r_square) 
 sieve.reset(i);
 } 
 } 
 // if you can eliminate an if() from inside your loops, do it
 //for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
 for(int i = sieve._Find_next(1); i < sieve.size(); i = sieve._Find_next(i))
 {
 //if(i==0||i==1)continue; -- not necessary, we won't get 0 and 1
 cout<<i<<" ";
 }
 cout<<endl;
}
// as mentioned by Toby Speight, you should use `int`, but of course...
// since you did that `#define int long long` ...
signed main()
{
 compute(LIM);
 return 0; 
}

The HUGE thing that C/C++ let you do but you should never ever do is:

#define int long long

That's going to get you fired from any sensible software company...

The next one is these two lines I added:

 if(n > limit)
 break;

That will probably be enough to get your algorithm to go fast. Right now you continue testing even when n is always going to be larger than limit.

One final comment, it is really rare in C/C++ that you would use an inclusive limit. There is of course nothing against that, but it makes the entry test (which I added) look really weird for an old C/C++ programmer. I understand that the Wikipedia page used an inclusive limit and thus you decided to use that...

user33306user33306 · Answer 1 · 2019-07-12 03:12:40Z

3

\$\begingroup\$

I'm not sure what your optimizations are doing(comments would be nice), but I did notice several inefficiencies:

Never include a calculation in the limit test in a for loop. It get's re-calculated on every iteration. In this case set the limit to sqrt(limit). the calculation is kind of expensive but it's done only once.

The same goes for the iteration step, calculate it once before you enter the loop.

In the tests, in the first set of loops, you use 3*x*x twice and y*y 3 times. putting these in variables would be a lot cheaper.

Share

answered Jul 12, 2019 at 3:12

user33306user33306

\$\endgroup\$

4

\$\begingroup\$ Any decent compiler should be able to hoist the 3*x*x and y*y rather than generating code that calculates them multiple times. \$\endgroup\$

Toby Speight
– Toby Speight

2019年07月12日 08:31:58 +00:00
Commented Jul 12, 2019 at 8:31
\$\begingroup\$ @tinstaafl can i just save the sqrt in a variable to reuse it later \$\endgroup\$

Mr. Wilson
– Mr. Wilson

2019年07月12日 09:29:53 +00:00
Commented Jul 12, 2019 at 9:29
\$\begingroup\$ Yes, sorry if I didn't make that clear. Put into a variable would mean it's calculated once. \$\endgroup\$

user33306
– user33306

2019年07月12日 09:43:18 +00:00
Commented Jul 12, 2019 at 9:43
\$\begingroup\$ @TobySpeight I find it an interesting question whether a developer should program taking into account the possible optimizations of a compiler, and to which extend. \$\endgroup\$

dfhwze
– dfhwze

2019年07月13日 08:09:45 +00:00
Commented Jul 13, 2019 at 8:09

Add a comment |

Peter Taylor Peter Taylor 24.4k1 gold badge49 silver badges94 bronze badges · Answer 2 · 2019-07-13 08:05:33Z

#define int long long

Not only is redefining int a bad idea, but using long long should be reserved for legacy code. Use <cstdint> and (given the ranges used in this code) std::uint_fast32_t.

 if (limit > 2) 
 cout<<"2 ";
 if (limit > 3) 
 cout<<"3 ";
 ... 21 lines ...
 for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
 {
 if(i==0||i==1)continue;
 cout<<i<<" ";
 }

Why not put the special cases (and the output in general) together at the end? And would it not be more transparent to do something like this?

 // Special cases: numbers smaller than the first prime or used in the wheel.
 sieve.reset(0);
 sieve.reset(1);
 sieve.set(2);
 sieve.set(3);
 sieve.set(5);

 for (int x = 1; x * x < limit; x++) {
 for (int y = 1; y * y < limit; y++) {
 int n = (4 * x * x) + (y * y);
 if (n <= limit && (n % 12 == 1 || n % 12 == 5))
 sieve.flip(n);
 n = (3 * x * x) + (y * y);
 if (n <= limit && n % 12 == 7)
 sieve.flip(n);
 n = (3 * x * x) - (y * y);
 if (x > y && n <= limit && n % 12 == 11)
 sieve.flip(n);
 }
 }

This looks rather inefficient. The first step in speed optimisation is to split out the cases and for each one iterate only over the (x,y) pairs which yield the relevant residues. E.g. for \4ドルx^2 + y^2 \in \{1, 5\} \pmod{12}\$ we have \$y \in \{1,5,7,11\} \pmod{12}\$ with any \$x\$; or \$y \equiv 3 \pmod{6}\$, \$x \not\equiv 0 \pmod{3}\$.

Not only does this save a whole lot of n % 12 == k checks, but it allows you to break out of each loop independently when the quadratic form hits limit.

Moreover, if you structure it carefully you can handle each residue with a separate bitset, which allows parallelisation.

The second speed optimisation is to look at the binary quadratic forms (BQFs) used. If you split the residues modulo 60 instead of modulo 12 then I find that there's a theoretical improvement of about 6% by using the following BQFs:

$$\begin{array}{} n \bmod 60 & \textrm{BQF} \\ 1, 19, 31, 49 & 15x^2 + y^2 \\ 7, 43 & 3x^2 + y^2 \\ 11, 59 & 3x^2 - y^2 \\ 13, 29, 37, 41 & 4x^2 + y^2 \\ 17, 23, 47, 53 & 5x^2 + 3y^2 \\ \end{array}$$

Ha! Ha! I'm not sure that someone who write #define int ... is ready for parallelization... :-)
@AlexisWilke, CR may be an edge case, but it's still part of the SE network, and answers can be valuable to people other than the OP.
#define int ... is not only a bad idea, it is undefined behavior.

Alexis Wilke Alexis Wilke 7844 silver badges9 bronze badges · Answer 3 · 2019-07-13 04:43:24Z

Here are a few things you could try...

Including those mentioned by @tinstaafl

I think that you can avoid a lot of the computations. I had a brief look at the Wikipedia page, it seems that you are otherwise properly following the algorithm.

#include<bitset>
#include<vector>
#include<iostream>
#include<algorithm>
#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
// this rarely helps--if anything it makes the code much bigger and you lose
// on the Instruction Cache!
//#pragma GCC optimization ("unroll-loops")
// As a long time programmer, I never ever do that (I delete such statement when
// I see them!) It's never safe it ignore namespaces.
using namespace std;
// Never do that; redefine a basic type using a macro?! Big no no!
// Also with any modern C/C++, you want to use typedef instead.
#define int long long
const int LIM = 1e9;
bitset<100000000> sieve;
void compute(int limit) 
{ 
 // to really be complete, you should verify the max. size
 // since you want to be able to access `limit` you must make sure
 // that this offset is also valid
 if(limit + 1 >= sieve.size())
 throw out_of_range("limit too large");
 // Do you really need these optimizations? Do they buy you anything?
 // It looks like your loops below start at 1 anyway, so you should
 // decide on one or the other anyway!
 if (limit > 2) 
 cout<<"2 ";
 if (limit > 3) 
 cout<<"3 ";
 // as mentioned by @tinstaafl replace (x * x < limit) with (x < sqrt(limit))
 // but "cache" that limit like so:
 // also something to make a habit of is to use pre-increment, here it makes
 // no difference since all your variables are numbers, but when dealing with
 // iterators (for example) it's a different story!
 //for (int x = 1; x * x < limit; x++) { 
 int const sqrt_limit(sqrt(limit));
 for (int x = 1; x < sqrt_limit; ++x) { 
 // see note in inner loop
 x_square = x * x;
 // apply same here
 //for (int y = 1; y * y < limit; y++) { 
 for (int y = 1; y < sqrt_limit; ++y) { 
 // you calculate x * x and y * y multiple times
 // it often helps to have intermediate variables
 // (especially if you have intermediate calls which
 // could prevent the compiler from optimizing)
 // also X can be calculated outside the loop
 y_square = y * y;
 //int n = (4 * x * x) + (y * y); 
 int n = 4 * x_square + y_square; 
 if (n <= limit && (n % 12 == 1 || n % 12 == 5)) 
 sieve.flip(n); 
 // this one seems to be used twice, either it's wrong or
 // at least you can optimize by calculating it once only
 //n = (3 * x * x) + (y * y); 
 n = 3 * x_square + y_square; 
 {
 // if it is wrong and both are not the same, keep as you've done
 if (n <= limit && n % 12 == 7) 
 sieve.flip(n);
 // so? same or needs fixing?
 n = (3 * x * x) - (y * y); // <- fix this...
 if (x > y && n <= limit && n % 12 == 11) 
 sieve.flip(n);
 }
 // OR
 {
 // they are the same, then you can simplify on 'n'
 // this optimization is always available, it's up to you to
 // know which of the 3 equations give you the smallest number
 //
 // This is most certainly what you are looking for...
 if(n > limit)
 break;
 // remember that the % involves a division
 // and again the compile may or may not be able to optimize...
 // here I do it manually to be sure
 int const n_mod12 = n % 12;
 if (n_mod12 == 7)
 sieve.flip(n);
 if (x > y && n_mod12 == 11) 
 sieve.flip(n);
 }
 } 
 } 
 // same sqrt() optimization
 //for (int r = 5; r * r < limit; ++r) { 
 for (int r = 5; r < sqrl_limit; ++r) { 
 if (sieve.test(r)) { 
 // same optimization as above, pre-calculate square
 int r_square(r * r);
 //for (int i = r * r; i < limit; i += r * r) 
 for (int i = r_square; i < limit; i += r_square) 
 sieve.reset(i);
 } 
 } 
 // if you can eliminate an if() from inside your loops, do it
 //for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
 for(int i = sieve._Find_next(1); i < sieve.size(); i = sieve._Find_next(i))
 {
 //if(i==0||i==1)continue; -- not necessary, we won't get 0 and 1
 cout<<i<<" ";
 }
 cout<<endl;
}
// as mentioned by Toby Speight, you should use `int`, but of course...
// since you did that `#define int long long` ...
signed main()
{
 compute(LIM);
 return 0; 
}

The HUGE thing that C/C++ let you do but you should never ever do is:

#define int long long

That's going to get you fired from any sensible software company...

The next one is these two lines I added:

 if(n > limit)
 break;

That will probably be enough to get your algorithm to go fast. Right now you continue testing even when n is always going to be larger than limit.

One final comment, it is really rare in C/C++ that you would use an inclusive limit. There is of course nothing against that, but it makes the entry test (which I added) look really weird for an old C/C++ programmer. I understand that the Wikipedia page used an inclusive limit and thus you decided to use that...

Stack Exchange Network

Optimizing primes generating program in C++

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Optimizing primes generating program in C++

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions