I am writing a program to generate primes in range 1 upto 10^9 but it seems to be timing out.
It implements the Sieve of Atkin
I have tried optimizing my code using bitsets and pragmas.
Please are there any other ways i can optimize my code. Code below
#include<bitset>
#include<vector>
#include<iostream>
#include<algorithm>
#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
#pragma GCC optimization ("unroll-loops")
using namespace std;
#define int long long
const int LIM = 1e9;
bitset<100000000> sieve;
void compute(int limit)
{
if (limit > 2)
cout<<"2 ";
if (limit > 3)
cout<<"3 ";
for (int x = 1; x * x < limit; x++) {
for (int y = 1; y * y < limit; y++) {
int n = (4 * x * x) + (y * y);
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve.flip(n);
n = (3 * x * x) + (y * y);
if (n <= limit && n % 12 == 7)
sieve.flip(n);
n = (3 * x * x) - (y * y);
if (x > y && n <= limit && n % 12 == 11)
sieve.flip(n);
}
}
for (int r = 5; r * r < limit; r++) {
if (sieve.test(r)) {
for (int i = r * r; i < limit; i += r * r)
sieve.reset(i);
}
}
for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
{
if(i==0||i==1)continue;
cout<<i<<" ";
}
cout<<endl;
}
signed main()
{
compute(LIM);
return 0;
}
Thanks in advance.
3 Answers 3
I'm not sure what your optimizations are doing(comments would be nice), but I did notice several inefficiencies:
Never include a calculation in the limit test in a for
loop. It get's re-calculated on every iteration. In this case set the limit to sqrt(limit)
. the calculation is kind of expensive but it's done only once.
The same goes for the iteration step, calculate it once before you enter the loop.
In the tests, in the first set of loops, you use 3*x*x
twice and y*y
3 times. putting these in variables would be a lot cheaper.
-
\$\begingroup\$ Any decent compiler should be able to hoist the
3*x*x
andy*y
rather than generating code that calculates them multiple times. \$\endgroup\$Toby Speight– Toby Speight2019年07月12日 08:31:58 +00:00Commented Jul 12, 2019 at 8:31 -
\$\begingroup\$ @tinstaafl can i just save the sqrt in a variable to reuse it later \$\endgroup\$Mr. Wilson– Mr. Wilson2019年07月12日 09:29:53 +00:00Commented Jul 12, 2019 at 9:29
-
\$\begingroup\$ Yes, sorry if I didn't make that clear. Put into a variable would mean it's calculated once. \$\endgroup\$user33306– user333062019年07月12日 09:43:18 +00:00Commented Jul 12, 2019 at 9:43
-
\$\begingroup\$ @TobySpeight I find it an interesting question whether a developer should program taking into account the possible optimizations of a compiler, and to which extend. \$\endgroup\$dfhwze– dfhwze2019年07月13日 08:09:45 +00:00Commented Jul 13, 2019 at 8:09
#define int long long
Not only is redefining int
a bad idea, but using long long
should be reserved for legacy code. Use <cstdint>
and (given the ranges used in this code) std::uint_fast32_t
.
if (limit > 2) cout<<"2 "; if (limit > 3) cout<<"3 "; ... 21 lines ... for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i)) { if(i==0||i==1)continue; cout<<i<<" "; }
Why not put the special cases (and the output in general) together at the end? And would it not be more transparent to do something like this?
// Special cases: numbers smaller than the first prime or used in the wheel.
sieve.reset(0);
sieve.reset(1);
sieve.set(2);
sieve.set(3);
sieve.set(5);
for (int x = 1; x * x < limit; x++) { for (int y = 1; y * y < limit; y++) { int n = (4 * x * x) + (y * y); if (n <= limit && (n % 12 == 1 || n % 12 == 5)) sieve.flip(n); n = (3 * x * x) + (y * y); if (n <= limit && n % 12 == 7) sieve.flip(n); n = (3 * x * x) - (y * y); if (x > y && n <= limit && n % 12 == 11) sieve.flip(n); } }
This looks rather inefficient. The first step in speed optimisation is to split out the cases and for each one iterate only over the (x,y)
pairs which yield the relevant residues. E.g. for \4ドルx^2 + y^2 \in \{1, 5\} \pmod{12}\$ we have \$y \in \{1,5,7,11\} \pmod{12}\$ with any \$x\$; or \$y \equiv 3 \pmod{6}\$, \$x \not\equiv 0 \pmod{3}\$.
Not only does this save a whole lot of n % 12 == k
checks, but it allows you to break out of each loop independently when the quadratic form hits limit
.
Moreover, if you structure it carefully you can handle each residue with a separate bitset
, which allows parallelisation.
The second speed optimisation is to look at the binary quadratic forms (BQFs) used. If you split the residues modulo 60 instead of modulo 12 then I find that there's a theoretical improvement of about 6% by using the following BQFs:
$$\begin{array}{} n \bmod 60 & \textrm{BQF} \\ 1, 19, 31, 49 & 15x^2 + y^2 \\ 7, 43 & 3x^2 + y^2 \\ 11, 59 & 3x^2 - y^2 \\ 13, 29, 37, 41 & 4x^2 + y^2 \\ 17, 23, 47, 53 & 5x^2 + 3y^2 \\ \end{array}$$
-
\$\begingroup\$ Ha! Ha! I'm not sure that someone who write
#define int ...
is ready for parallelization... :-) \$\endgroup\$Alexis Wilke– Alexis Wilke2019年07月13日 10:15:19 +00:00Commented Jul 13, 2019 at 10:15 -
1\$\begingroup\$ @AlexisWilke, CR may be an edge case, but it's still part of the SE network, and answers can be valuable to people other than the OP. \$\endgroup\$Peter Taylor– Peter Taylor2019年07月13日 12:31:11 +00:00Commented Jul 13, 2019 at 12:31
-
\$\begingroup\$
#define int ...
is not only a bad idea, it is undefined behavior. \$\endgroup\$L. F.– L. F.2019年07月16日 00:22:28 +00:00Commented Jul 16, 2019 at 0:22
Here are a few things you could try...
Including those mentioned by @tinstaafl
I think that you can avoid a lot of the computations. I had a brief look at the Wikipedia page, it seems that you are otherwise properly following the algorithm.
#include<bitset>
#include<vector>
#include<iostream>
#include<algorithm>
#pragma GCC target ("avx2")
#pragma GCC optimization ("O3")
// this rarely helps--if anything it makes the code much bigger and you lose
// on the Instruction Cache!
//#pragma GCC optimization ("unroll-loops")
// As a long time programmer, I never ever do that (I delete such statement when
// I see them!) It's never safe it ignore namespaces.
using namespace std;
// Never do that; redefine a basic type using a macro?! Big no no!
// Also with any modern C/C++, you want to use typedef instead.
#define int long long
const int LIM = 1e9;
bitset<100000000> sieve;
void compute(int limit)
{
// to really be complete, you should verify the max. size
// since you want to be able to access `limit` you must make sure
// that this offset is also valid
if(limit + 1 >= sieve.size())
throw out_of_range("limit too large");
// Do you really need these optimizations? Do they buy you anything?
// It looks like your loops below start at 1 anyway, so you should
// decide on one or the other anyway!
if (limit > 2)
cout<<"2 ";
if (limit > 3)
cout<<"3 ";
// as mentioned by @tinstaafl replace (x * x < limit) with (x < sqrt(limit))
// but "cache" that limit like so:
// also something to make a habit of is to use pre-increment, here it makes
// no difference since all your variables are numbers, but when dealing with
// iterators (for example) it's a different story!
//for (int x = 1; x * x < limit; x++) {
int const sqrt_limit(sqrt(limit));
for (int x = 1; x < sqrt_limit; ++x) {
// see note in inner loop
x_square = x * x;
// apply same here
//for (int y = 1; y * y < limit; y++) {
for (int y = 1; y < sqrt_limit; ++y) {
// you calculate x * x and y * y multiple times
// it often helps to have intermediate variables
// (especially if you have intermediate calls which
// could prevent the compiler from optimizing)
// also X can be calculated outside the loop
y_square = y * y;
//int n = (4 * x * x) + (y * y);
int n = 4 * x_square + y_square;
if (n <= limit && (n % 12 == 1 || n % 12 == 5))
sieve.flip(n);
// this one seems to be used twice, either it's wrong or
// at least you can optimize by calculating it once only
//n = (3 * x * x) + (y * y);
n = 3 * x_square + y_square;
{
// if it is wrong and both are not the same, keep as you've done
if (n <= limit && n % 12 == 7)
sieve.flip(n);
// so? same or needs fixing?
n = (3 * x * x) - (y * y); // <- fix this...
if (x > y && n <= limit && n % 12 == 11)
sieve.flip(n);
}
// OR
{
// they are the same, then you can simplify on 'n'
// this optimization is always available, it's up to you to
// know which of the 3 equations give you the smallest number
//
// This is most certainly what you are looking for...
if(n > limit)
break;
// remember that the % involves a division
// and again the compile may or may not be able to optimize...
// here I do it manually to be sure
int const n_mod12 = n % 12;
if (n_mod12 == 7)
sieve.flip(n);
if (x > y && n_mod12 == 11)
sieve.flip(n);
}
}
}
// same sqrt() optimization
//for (int r = 5; r * r < limit; ++r) {
for (int r = 5; r < sqrl_limit; ++r) {
if (sieve.test(r)) {
// same optimization as above, pre-calculate square
int r_square(r * r);
//for (int i = r * r; i < limit; i += r * r)
for (int i = r_square; i < limit; i += r_square)
sieve.reset(i);
}
}
// if you can eliminate an if() from inside your loops, do it
//for(int i=sieve._Find_first();i< sieve.size();i = sieve._Find_next(i))
for(int i = sieve._Find_next(1); i < sieve.size(); i = sieve._Find_next(i))
{
//if(i==0||i==1)continue; -- not necessary, we won't get 0 and 1
cout<<i<<" ";
}
cout<<endl;
}
// as mentioned by Toby Speight, you should use `int`, but of course...
// since you did that `#define int long long` ...
signed main()
{
compute(LIM);
return 0;
}
The HUGE thing that C/C++ let you do but you should never ever do is:
#define int long long
That's going to get you fired from any sensible software company...
The next one is these two lines I added:
if(n > limit)
break;
That will probably be enough to get your algorithm to go fast. Right now you continue testing even when n
is always going to be larger than limit
.
One final comment, it is really rare in C/C++ that you would use an inclusive limit. There is of course nothing against that, but it makes the entry test (which I added) look really weird for an old C/C++ programmer. I understand that the Wikipedia page used an inclusive limit and thus you decided to use that...
signed main()
is much less idiomatic thanint main()
. \$\endgroup\$