This is a follow up to How slow is Python really? (Or how fast is your language?).
It turns out it was a bit too easy to get a x100 speedup for my last question. For those who have enjoyed the challenge but want something harder where they can really use their low level skills, here is part II. The challenge is to get a x100 speedup for the following python code as tested on my computer.
To make it more difficult I am using pypy this time. The current timing for me is 1 minute and 7 seconds using pypy 2.2.1.
Rules
- The first person to submit code which I can run, is correct and is x100 times faster on my machine will be awarded a bounty of 50 points.
- I will award the win to the fastest code after a week.
import itertools
import operator
import random
n = 8
m = 8
iters = 1000
# creates an array of 0s with length m
# [0, 0, 0, 0, 0, 0, 0, 0]
leadingzerocounts = [0]*m
# itertools.product creates an array of all possible combinations of the
# args passed to it.
#
# Ex:
# itertools.product("ABCD", "xy") --> Ax Ay Bx By Cx Cy Dx Dy
# itertools.product("AB", repeat=5) --> [
# ('A', 'A', 'A', 'A', 'A'),
# ('A', 'A', 'A', 'A', 'B'),
# ('A', 'A', 'A', 'B', 'A'),
# ('A', 'A', 'A', 'B', 'B'),
# etc.
# ]
for S in itertools.product([-1,1], repeat = n+m-1):
for i in xrange(iters):
F = [random.choice([-1,0,0,1]) for j in xrange(n)]
# if the array is made up of only zeros keep recreating it until
# there is at least one nonzero value.
while not any(F):
F = [random.choice([-1,0,0,1]) for j in xrange(n)]
j = 0
while (j < m and sum(map(operator.mul, F, S[j:j+n])) == 0):
leadingzerocounts[j] +=1
j += 1
print leadingzerocounts
The output should be similar to
[6335185, 2526840, 1041967, 439735, 193391, 87083, 40635, 19694]
You must use a random seed in your code and any random number generator that is good enough to give answers close to the above will be accepted.
My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.
Explanation of code
This code iterates over all arrays S of length n+m-1 that are made up for -1s and 1s. For each array S it samples 1000 non-zero random arrays F of length n made up of -1,0 or 1 with probability of 1/4, 1/2,/14 of taking each values. It then computes the inner products between F and each window of S of length n until it finds a non-zero inner product. It adds 1 to leadingzerocounts
at each position it found a zero inner product.
Status
Perl. 2.7 times slowdown by @tobyink. (Compared to pypy not cpython.)
J. 39 times speedup by @Eelvex.
- C. 59 times speed up by @ace.
- Julia. 197 times faster not including start up time by @one-more-minute. 8.5 times speed up including start up time (it's faster using 4 processors in this case than 8).
- Fortran. 438 times speed up by @semi-extrinsic.
- Rpython. 258 times speed up by @primo.
- C++. 508 times speed up by @ilmale.
(I stopped timing the new improvements because they are too fast and iters was too small.)
It was pointed out that timings below a second are unreliable and also some languages have a start-up cost. The argument is that if you are to include that you should also include the compilation time of C/C++ etc. Here are the timings for the fastest code with the number of iterations increased to 100,000.
- Julia. 42 seconds by @one-more-minute.
- C++. 14 seconds by @GuySirton.
- Fortran. 14s by @semi-extrinsic.
- C++. 12s by @ilmale.
- Rpython. 18s by @primo.
- C++. 5s by @Stefan.
The winner is.. Stefan!
Follow-up challenge posted. How high can you go? (A coding+algorithms challenge) . This one is harder.
-
3\$\begingroup\$ an explanation of what the code is suppose to achieve would be nice, so we can rewrite it and not simply port it \$\endgroup\$Einacio– Einacio2014年04月28日 16:55:43 +00:00Commented Apr 28, 2014 at 16:55
-
6\$\begingroup\$ "The first person to submit code which I can run, is correct and is x100 times faster on my machine wins immediately and the competition closes." What is the purpose of closing the competition like that? Why not use a date deadline like most others, so we can see it furthered reduced in other languages? \$\endgroup\$grovesNL– grovesNL2014年04月28日 17:44:17 +00:00Commented Apr 28, 2014 at 17:44
-
5\$\begingroup\$ @Einacio That is a nice idea. I changed the rules which I hope no one will mind. \$\endgroup\$user9206– user92062014年04月28日 17:50:41 +00:00Commented Apr 28, 2014 at 17:50
-
1\$\begingroup\$ @Lembik I've improved my Fortran version, making it 2x faster again on my machine. Could you time it again? :) \$\endgroup\$semi-extrinsic– semi-extrinsic2014年05月02日 12:16:48 +00:00Commented May 2, 2014 at 12:16
-
1\$\begingroup\$ @semi-extrinsic Done. \$\endgroup\$user9206– user92062014年05月02日 19:20:15 +00:00Commented May 2, 2014 at 19:20
14 Answers 14
C++ (削除) x150 (削除ここまで) (削除) x450 (削除ここまで) x530
Instead of array I used bits (and dark magic).
Thanks @ace for the faster random function.
How does it work:
the first 15th bits of the integer s
represent the array S[15]
; the zeroes represent -1, the ones represent +1.
The array F
is build in a similar way. But with two bit for each symbol.
- 00 represent -1
- 01 and 10 represent 0
- 11 represent 1
Cause S
and F
have a different representation I have to interleave S
with itself to be comparable with F
.
- 0 (-1) became 00 (-1 in the representation of
F
) - 1 (+1) became 11 (+1 in the representation of
F
)
Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11
0 . 00 = 11 (-1 * -1 = +1)
0 . 01 = 10 (-1 * 0 = 0)
0 . 10 = 01 (-1 * 0 = 0)
0 . 11 = 00 (-1 * +1 = -1)
1 . 00 = 00 (+1 * -1 = -1)
1 . 10 = 10 (+1 * 0 = 0)
1 . 01 = 01 (+1 * 0 = 0)
1 . 11 = 11 (+1 * +1 = +1)
Looks like a not xor to me. :)
Sum the ones is just a game of shift and mask, nothing really complex.
#include <array>
#include <ctime>
// From standford bithacks
// http://graphics.stanford.edu/~seander/bithacks.html
inline int32_t interleaveBit(int32_t x)
{
static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
x = (x | ( x << 8)) & B[3];
x = (x | ( x << 4)) & B[2];
x = (x | ( x << 2)) & B[1];
x = (x | ( x << 1)) & B[0];
return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
static int b[] = { 1, 0, 0, 1};
int s = 0;
for( int i = 0; i < 8; ++i)
{
const int a = 3&(v>>(i*2));
s += b[a];
}
return s;
}
inline int32_t sumArray(int32_t v)
{
static int b[] = { -1, 0, 0, 1};
int s = 0;
for( int i = 0; i < 8; ++i)
{
const int a = 3&(v>>(i*2));
s += b[a];
}
return s;
}
uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds
int32_t myRand()
{
uint32_t t;
t = x ^ (x<<1);
x = y;
y = z;
z = w;
w = w ^ ( w >> 19) ^ t ^ (t >> 8);
return w;
}
int main()
{
std::array<int, 8> leadingZero{0};
x = static_cast<int32_t>(time(nullptr)); // seed PRNG
const int maxS = 1 << 15;
for(int s = 0; s < maxS; ++s)
{
const int32_t x = interleaveBit(s);
for(int i = 0; i < 1000; ++i)
{
int32_t random;
do
{
random = 0xFFFF & myRand();
}while(sumOnes(random) == 0);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j*2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if(sumArray(l) == 0)
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
for(int i = 7; i >= 0; --i)
{
printf("%d ", leadingZero[i]);
}
printf("\n");
return 0;
}
Here a sample output:
6332350 2525218 1041716 438741 192917 87159 41023 19908
real 0m0.372s
user 0m0.371s
sys 0m0.001s
The program has been compiled with:
gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy
on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.
Probably I can gain some ms tweaking compiler parameters.
While this is the OP solution time on my machine:
time pypy 26371.py
[6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]
real 0m53.061s
user 0m53.016s
sys 0m0.022s
Edit:
Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D
In this case the leadingZero
array must be atomic. The random global... are random, they will be more random.
#pragma omp parallel for
for(int i = 0; i < 1000; ++i)
{
int32_t random;
do
{
random = 0xFFFF & myRand();
}while(sumOnes(random) == 0);
for(int s = 0; s < maxS; ++s)
{
const int32_t x = interleaveBit(s);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j*2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if( sumArray(l) == 0 )
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
need to add -fopenmp
to the compiler flag
Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.
real 0m0.101s
user 0m0.101s
sys 0m0.000s
With openmp is slower, cause the atomics add too much overhead.
real 0m0.253s
user 0m1.870s
sys 0m0.001s
Without atomics is even faster, but I get wrong result.
2137992 1147218 619297 321243 155815 70946 32919 15579
real 0m0.048s
user 0m0.338s
sys 0m0.001s
To understand sumArray consider that 16 bit represent and array of 8 numbers.
00 have no 1 and represent -1
01 and 10 have one 1 and represent 0
11 have two 1s and represent 1
So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.
sumOnes is just black magic.
Here the latest compile flags and code.
gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++
#include <cstdint>
#include <cstdio>
#include <ctime>
inline int32_t interleaveBit(int32_t x)
{
static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
x = (x | ( x << 8)) & B[3];
x = (x | ( x << 4)) & B[2];
x = (x | ( x << 2)) & B[1];
x = (x | ( x << 1)) & B[0];
return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
/* 0xAAAA == 0b1010 1010 1010 1010 */
return !!(0xAAAA & (v ^ ~(v << 1)));
}
inline int32_t sumArray(int32_t v)
{
return __builtin_popcount(v) - 8;
}
uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds
int32_t myRand()
{
uint32_t t;
t = x ^ (x << 1);
x = y;
y = z;
z = w;
w = w ^ ( w >> 19) ^ t ^ (t >> 8);
return w;
}
int main()
{
int leadingZero[8] = { 0 };
x = static_cast<int32_t>(time(nullptr)); // seed PRNG
const int maxS = 1 << 15;
for( int i = 0; i < 1000; ++i )
{
int32_t random;
do
{
random = 0xFFFF & myRand();
} while(sumOnes(random) == 0 );
for( int s = 0; s < maxS; ++s )
{
const int32_t x = interleaveBit(s);
int j = 7;
while( j >= 0 )
{
const int32_t h = (x >> (j * 2));
const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
if( sumArray(l) == 0 )
{
leadingZero[j]++;
} else
{
break;
}
j--;
}
}
}
printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
leadingZero[7], leadingZero[6],
leadingZero[5], leadingZero[4],
leadingZero[3], leadingZero[2],
leadingZero[1], leadingZero[0]);
return 0;
}
-
\$\begingroup\$ Now I can't wait to test this! Sadly this won't be for a few hours. \$\endgroup\$user9206– user92062014年04月30日 09:38:03 +00:00Commented Apr 30, 2014 at 9:38
-
1\$\begingroup\$ The following was in a suggested edit, but I think that it may fit better as a comment. You can replace your sumOnes, sumArray with the following (seems to give me a 2x speed up over the openmp version).
inline int32_t sumOnes(int32_t v) { /* 0xAAAA == 0b1010 1010 1010 1010 */ return !! (0xAAAA & (v ^ ~(v << 1))); } inline int32_t sumArray(int32_t v) { return __builtin_popcount(v) - 8; }
this was suggested by @user71404 \$\endgroup\$user12205– user122052014年04月30日 14:33:41 +00:00Commented Apr 30, 2014 at 14:33 -
\$\begingroup\$ @user71404: profiler said that the program spent all it's time in those two functions, but I was really tired yesterday think something better that that. I'll try this evening (UTC). Thanks. \$\endgroup\$ilmale– ilmale2014年04月30日 14:40:57 +00:00Commented Apr 30, 2014 at 14:40
-
\$\begingroup\$ Would you mind changing the second code snippet to be the full copy and pastable code? I must be doing something wrong in my attempts to get your openmp code working so this would help a lot. \$\endgroup\$user9206– user92062014年04月30日 18:14:18 +00:00Commented Apr 30, 2014 at 18:14
-
\$\begingroup\$ Nice. I thought this could be done with bit operations. \$\endgroup\$Guy Sirton– Guy Sirton2014年04月30日 20:09:41 +00:00Commented Apr 30, 2014 at 20:09
C++ bit magic
~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.
(speedup is based on multithreaded code on my i7-2820QM and the 1 min 9 seconds mentioned in the question. Since the OP's system has worse single threaded performance than my CPU but better multi threaded perfiormance i expect this number to be accurate)
The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom job library but that one has bugs under unix systems.. For an explanation and almost identical code without threading refer to https://codegolf.stackexchange.com/a/26485/20965.
edit
I gave each thread it's own RNG and cut down the bit length to 32 which reduced the runtime by a few ms.
#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <array>
#include <tuple>
#include <memory>
#include <thread>
#include <future>
#include <string.h>
#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif
void convolve()
{
static const unsigned threadCount = 32;
static const unsigned n = 8;
static const unsigned m = 8;
static const unsigned totalIters = 1000;
static_assert( n <= 16, "packing of F fails when n > 16.");
static uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
std::array< uint32_t, m * threadCount > out;
std::vector< std::future<void> > threads;
for( int threadId = 0; threadId < threadCount; threadId++)
{
threads.emplace_back( std::async( [&, threadId]
{
std::random_device rd;
std::knuth_b gen(rd());
uint32_t nextRandomNumber = gen();
const unsigned iters = totalIters / threadCount;
std::array< uint32_t, m > leadingZeros;
for( auto& x : leadingZeros )
x = 0;
for( unsigned i = 0; i < iters; i++ )
{
// generate random bit mess
uint32_t F;
do {
// this funky looking construction shortens the dependancy chain of F
F = nextRandomNumber & fmask;
nextRandomNumber = gen();
} while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
// Assume F is an array with interleaved elements such that F[0] || F[16] is one element
// here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
// and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
// this results in the distribution ( -1, 0, 0, 1 )
// to ease calculations we generate r = LSB(F) and l = MSB(F)
uint32_t r = F % ( 1 << n );
// modulo is required because the behaviour of the leftmost bit is implementation defined
uint32_t l = ( F >> 16 ) % ( 1 << n );
uint32_t posBits = l & ~r;
uint32_t negBits = ~l & r;
assert( (posBits & negBits) == 0 );
uint32_t mask = posBits | negBits;
uint32_t totalBits = popcnt( mask );
// if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
if ( totalBits & 1 )
continue;
uint32_t adjF = posBits & ~negBits;
uint32_t desiredBits = totalBits / 2;
uint32_t S = (1 << (n + m -1));
// generate all possible N+1 bit strings
// 1 = +1
// 0 = -1
while ( S-- )
{
for( int shift = 0; shift < m; shift++ )
{
uint32_t s = (S >> shift) % ( 1 << n );
auto firstBits = (s & mask) ^ adjF;
if ( desiredBits == popcnt( firstBits ) )
{
leadingZeros[shift] = leadingZeros[shift] + 1;
}
else
{
break;
}
}
}
}
memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );
} ));
};
std::array< uint32_t, m > leadingZeros;
for( auto& x : leadingZeros )
x = 0;
for( auto& thread : threads )
{
thread.wait();
}
for( int i = 0; i < (threadCount * m); i++ )
{
leadingZeros[i % m] += out[i];
}
for( auto x : leadingZeros )
std::cout << x << ", ";
std::cout << std::endl;
}
int main()
{
typedef std::chrono::high_resolution_clock clock;
int rounds = 100;
// do some rounds to get the cpu up to speed..
for( int i = 0; i < rounds / 10; i++ )
{
convolve();
}
auto start = clock::now();
for( int i = 0; i < rounds; i++ )
convolve();
auto end = clock::now();
double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
return 0;
}
Sample output:
6317312, 2515072, 1034368, 434048, 190144, 85200, 39804, 19168,
6226944, 2481408, 1031168, 438080, 192896, 86816, 40484, 19490,
6321152, 2524672, 1045376, 442880, 195680, 88464, 41656, 20212,
6330624, 2517504, 1031104, 430208, 187696, 83976, 38976, 18708,
6304768, 2510336, 1030720, 433056, 190880, 86824, 40940, 19840,
6272512, 2494720, 1028160, 432352, 189168, 84752, 39540, 19052,
6233600, 2507520, 1046912, 447008, 198224, 89984, 42092, 20292,
-
\$\begingroup\$ The output isn't right I think, maybe there is a bug? Compare to what is in the question. In particular the last column should be a number close to 19180. \$\endgroup\$user9206– user92062014年05月02日 06:10:31 +00:00Commented May 2, 2014 at 6:10
-
\$\begingroup\$ @Lembik i can see what you mean. I think that the random output is not random enough which creates funky output sometimes. With the C++11 random generator it works fine. I'll fix the code later today. \$\endgroup\$Stefan– Stefan2014年05月02日 12:08:47 +00:00Commented May 2, 2014 at 12:08
-
\$\begingroup\$ I get Stefan.cpp:104:101: error: ‘memcpy’ was not declared in this scope memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m ); \$\endgroup\$user9206– user92062014年05月05日 08:23:37 +00:00Commented May 5, 2014 at 8:23
-
\$\begingroup\$ I think that i need to include string.h. Try it again. \$\endgroup\$Stefan– Stefan2014年05月05日 09:41:08 +00:00Commented May 5, 2014 at 9:41
-
\$\begingroup\$ You compile this with g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed Stefan.cpp -o Stefan \$\endgroup\$user9206– user92062014年05月05日 18:55:51 +00:00Commented May 5, 2014 at 18:55
Julia: 0.7s, 120x faster
As user20768 demonstrated, a straightforward port of the code to Julia is about twice as fast as PyPy. But we can do a lot better than that.
function pleadingzerocounts(; n = 8,
m = 8,
iters = 1000)
@parallel (+) for S = 1:2^(8+8-1)
leading_counts = zeros(Int, m)
F = Array(Int, n)
for i = 1:iters
flag = 0
while flag == 0
for i = 1:n
v = (1-(rand(Int8)&3))%2
@inbounds F[i] = v
flag += v & 1
end
end
for j = 1:m
sum = 0
for i = 1:n
@inbounds sum += S & (1 << (j + i - 2)) > 0 ? F[i] : -F[i]
end
sum == 0 ?
(leading_counts[j] += 1) :
break
end
end
leading_counts
end
end
function main()
# Warm up the JIT
pleadingzerocounts()
# Then go for real
println(@time pleadingzerocounts())
end
You can run this using julia -p 8 -e 'require("golf.jl");main()'
(the 8 is the number of processes, you might want to play around with it). On the latest Julia prerelease this takes 0.7s vs. 1m22s for PyPy.
If you have enough cores on your computer, and perhaps spin up a few AWS instances, you should be able to shave off some more :)
-
\$\begingroup\$ I'm quite sure you're measuring the timing wrong. Python with Pypy is also a JIT-based language, but the timings made by the OP include the JIT compilation time. You are excluding it. I installed the latest Julia git version and tested your code, and on my machine your command with 8 processes takes 6.6 seconds to finish, but it prints "elapsed time 0.588.. seconds". \$\endgroup\$semi-extrinsic– semi-extrinsic2014年04月30日 08:53:55 +00:00Commented Apr 30, 2014 at 8:53
-
\$\begingroup\$ The Python timing does include PyPy startup and JIT warmup, but that takes seconds at most – the difference over a minute of runtime is negligible. I'm happy if the OP changes the way the Python is timed (it won't make any difference), but including Julia's startup time wouldn't be reasonable. \$\endgroup\$one-more-minute– one-more-minute2014年04月30日 09:04:24 +00:00Commented Apr 30, 2014 at 9:04
-
\$\begingroup\$ I asked the OP in the comments to the original question, and he said "The timings should include everything for JIT languages." He also stated he will create a new challenge where the solutions will take much longer than 1 second, leaving Julia in the competition. \$\endgroup\$semi-extrinsic– semi-extrinsic2014年04月30日 12:17:01 +00:00Commented Apr 30, 2014 at 12:17
-
\$\begingroup\$ In that case the optimal solution is to use a serial algorithm – that takes about 2s. I'd publish the code but this competition is now somewhat redundant, since everyone already knows that C++ boots faster than everything else. \$\endgroup\$one-more-minute– one-more-minute2014年04月30日 12:33:56 +00:00Commented Apr 30, 2014 at 12:33
-
\$\begingroup\$ I just posted my Fortran solution, so I don't see why you shouldn't post the fastest Julia one (if you already have the code). \$\endgroup\$semi-extrinsic– semi-extrinsic2014年04月30日 13:04:09 +00:00Commented Apr 30, 2014 at 13:04
C, 1.210s
With OP's code running 1m45.729s on my machine.
Compilation:
gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall ./test2.c -o ./test2
Special thanks: @dyp for compilation flags and ideas for optimisations.
#include <stdio.h>
#include <time.h>
#define n (8)
#define m (8)
#define iters (1000)
int leadingzerocounts[m]; // declared as global so initialised to 0
unsigned int x,y=34353,z=57768,w=1564; //PRNG seeds
/* xorshift PRNG
* Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
* Used under CC-By-SA */
int myRand() {
unsigned int t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}
int dotproduct(int*F, int*S) {
unsigned int i;
int sum=0;
for(i=0; i<n; i++) {
sum+=F[i]*S[i];
}
return sum;
}
int main() {
unsigned int i, j, tmp;
x=(int)time(NULL); //seed PRNG
int S[n+m-1];
for(i=0; i<(1<<(n+m-1)); i++) {
tmp=i;
for(j=0; j<n+m-1; j++) {
S[j]=(tmp&1)*(-2)+1;
tmp>>=1;
}
for(j=0; j<iters; j++) {
int F[n];
unsigned int k, flag=0;
do {
for(k=0; k<n; k++) {
F[k]=(1-(myRand()&3))%2;
flag+=(F[k]&1);
}
} while(!flag);
for(k=0; k<m&&!dotproduct(F, S+k); k++) {
leadingzerocounts[k]++;
}
}
}
for(i=0; i<m; i++) printf("%d ", leadingzerocounts[i]);
return 0;
}
Sample output:
6334411 2527506 1042239 439328 192914 87005 40847 19728
-
1\$\begingroup\$ Interesting indeed, I can make similar observations when dropping all those optimization flags. Try
-march=native -fwhole-program -fstrict-aliasing -ftree-vectorize
Btw. I got down to < 4 s by using some C++11 including a MT19937 plus auniform_int_distribution
. \$\endgroup\$dyp– dyp2014年04月28日 19:12:57 +00:00Commented Apr 28, 2014 at 19:12 -
1\$\begingroup\$ 1.119s making a speedup of roughly 59! \$\endgroup\$user9206– user92062014年04月28日 20:11:10 +00:00Commented Apr 28, 2014 at 20:11
-
1\$\begingroup\$ @ace Yes, I just wanted to point this out. It was simpler for me just to try some of the standard library PRNGs in C++. Note that you can use one 32-bit integer result from a PRNG to produce 8 entries for
F
. \$\endgroup\$dyp– dyp2014年04月28日 20:17:11 +00:00Commented Apr 28, 2014 at 20:17 -
1\$\begingroup\$ Since
n
is equal to8
, you can probably use AVX (or 2*SSE) to compute the dotproduct with a properS
storage. \$\endgroup\$Michael M.– Michael M.2014年04月29日 05:50:47 +00:00Commented Apr 29, 2014 at 5:50 -
2\$\begingroup\$ SSE version, small speed-up : gist.github.com/anonymous/11394210 (dont forget to include
smmintrin.h
) \$\endgroup\$Michael M.– Michael M.2014年04月29日 08:41:33 +00:00Commented Apr 29, 2014 at 8:41
Perl
This is nowhere near as fast as the C solution, but is pretty fast for a high-level interpreted language I think. It shaves about 40% off the running time of the Python implementation.
#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
use Algorithm::Combinatorics qw( variations_with_repetition );
use List::Util qw( any sum );
use constant {
N => 8,
M => 8,
ITERS => 1000,
};
my @leadingzerocounts;
my $variations = variations_with_repetition([-1, 1], N + M - 1);
while (my $S = $variations->next)
{
for my $i (1 .. ITERS)
{
my @F;
until (@F and any { $_ } @F)
{
@F = map +((-1,0,0,1)[rand 4]), 1..N;
}
my $j = 0;
while ($j < M)
{
last if sum map $F[$_]*$S->[$j+$_], 0..N-1;
$leadingzerocounts[$j++]++;
}
}
}
say join ", ", @leadingzerocounts;
The Algorithm::Combinatorics is available in Ubuntu (sudo apt-get install libalgorithm-combinatorics-perl
). The other modules used are core Perl modules, so should already be installed as part of the base Ubuntu installation.
-
1\$\begingroup\$ It won't affect speed, but it's
0..N-1
range in the lastmap
, right? Did you forget touse warnings
? :-) Though logic in OP is confusing, sliding window never gets to the last element ofS
. \$\endgroup\$user2846289– user28462892014年04月29日 11:21:09 +00:00Commented Apr 29, 2014 at 11:21 -
\$\begingroup\$ Ah. I just figured that the arrays were of mismatched sizes, so I disabled
warnings
allowing the missing elements to be treated as zero.N-1
improves this. And it does actually improve the speed very slightly — it's now about 40% faster than the Python implementation. \$\endgroup\$tobyink– tobyink2014年04月29日 13:22:19 +00:00Commented Apr 29, 2014 at 13:22 -
\$\begingroup\$ I think your code requires a very modern version of List::Util. On ubuntu 14.04 I get "any" is not exported by the List::Util module \$\endgroup\$user9206– user92062014年04月30日 18:26:00 +00:00Commented Apr 30, 2014 at 18:26
-
\$\begingroup\$ Oh yes, that's true - you'll probably need to install List::Util off CPAN.
any
can alternatively be found in List::MoreUtils, which although not a core module is one of the most commonly used CPAN modules. \$\endgroup\$tobyink– tobyink2014年04月30日 18:36:35 +00:00Commented Apr 30, 2014 at 18:36
Julia: 4.66x slower!
I am really beginning to doubt the statistics on their website...
Note that the following Julia code is effectively a direct transcription of the OP's Python code without any optimisations. I use the time()
function to exclude Julia's slow startup time...
srand(27182818284590)
t = time()
require("Iterators")
n = 8
m = 8
iters = 1000
bothzero = 0
leadingzerocounts = zeros(m)
for S in Iterators.product(fill([-1,1], n+m-1)...)
for i = 1:iters
F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
while all((x) -> x == 0, F)
F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
end
j = 1
while j <= m && sum(map(*, F, S[j:j+n-1])) == 0
leadingzerocounts[j] += 1
j += 1
end
end
end
println(leadingzerocounts)
t = time() - t
println("$t seconds")
Julia: 5 m 32.912 s
OP's code in PyPy: 1 m 11.506 s
Julia output:
6332170
2525472
1041522
438761
193119
86873
40705
19662
-
\$\begingroup\$ Global variables, imports, and array comprehensions are slow. This isn't how one would typically write a Julia program for performance. \$\endgroup\$Alex A.– Alex A.2015年04月12日 04:54:16 +00:00Commented Apr 12, 2015 at 4:54
Julia: 1 min 21.4s (2.2x faster) (modification of Arman's code)
Op's code in PyPy: 3 min 1.4 s
Both done in the REPL, not including time to load packages.
function foo()
n = 8
m = 8
iters = 1000
bothzero = 0
leadingzerocounts = zeros(Int,m)
P=[-1,0,0,1]
for S in Iterators.product(fill([-1,1], n+m-1)...)
Sm=[S...]
for i = 1:iters
F = P[rand(1:4,n)]
while all(F==0)
F = P[rand(1:4,n)]
end
j = 1
while j <= m && dot(F,Sm[j:j+n-1]) == 0
leadingzerocounts[j] += 1
j += 1
end
end
end
println(leadingzerocounts)
end
There are some problems with Arman's code making it very slow: It uses a lot of anonymous functions and higher order functions unnecessarily. To test if all of a vector F is zero, why not just write all(F==0) instead of all(x->x==0,F)? It is shorter, and a literally a thousand times faster.
It also uses sum(map(*,x,y)) as the dot product instead of simply dot(x,y). The first version 650 times slower for a vector of 10k doubles. And the dot product function is implemented as a for loop in pure Julia.
Also, array comprehensions are slow. It is better to write [0,1,0,-1][rand(1:4,n)] instead of [[-1 0 0 1][rand(1:4)] for j = 1:n].
Finally, global variables are bad juju in Julia. Julia is only fast if you write code in such a way that allows the JIT and type inference to work. A big part of this is type stability: The compiler must be able to be sure that a variable's type will not change while inside a loop, for example.
-
\$\begingroup\$ Thanks! I see that I still have quite a bit to learn about the Julia Language before I can go making claims about its speed :) Really glad to see that a few trivial fixes to my code increase its execution time several fold. \$\endgroup\$nimble agar– nimble agar2014年05月01日 06:01:38 +00:00Commented May 1, 2014 at 6:01
RPython 0.187s (258x faster)
Original Source w/ PyPy2.2.1: 1m 6.718s
Now with threading, back-support for standard Python has been dropped. The number of worker threads can be specified as a command line parameter, default is two.
from time import time, sleep
from math import fmod
from rpython.rlib.rthread import start_new_thread, allocate_lock, get_ident
class Random:
__slots__ = ['s']
def __init__(self, s=1):
self.s = s
def init_genrand(self, seed):
self.s = seed
def genrand32(self):
# xorshift PRNG with period 2^32-1
# see http://core.kmi.open.ac.uk/download/pdf/6250138.pdf
self.s ^= (self.s << 13)
self.s ^= (self.s >> 17)
self.s ^= (self.s << 5)
return self.s
class ThreadEnv:
__slots__ = ['n', 'm', 'iters', 'counts', 'running', 'lock']
def __init__(self):
self.n = 8
self.m = 8
self.iters = 1000
self.counts = [0]*8
self.running = 0
self.lock = None
env = ThreadEnv()
truth = [-1,0,0,1]
def main(argv):
argc = len(argv)
if argc < 4 or argc > 5:
print 'Usage: %s N M ITERS [NUM_THREADS=2]'%argv[0]
return 1
if argc == 5:
num_threads = int(argv[4])
else:
num_threads = 2
env.n = int(argv[1])
env.m = int(argv[2])
env.iters = int(argv[3]) // num_threads
env.counts = [0]*env.m
env.lock = allocate_lock()
for i in xrange(num_threads-1):
start_new_thread(run,())
env.running += 1
env.running += 1
# use the main process as a worker
run()
# wait for any laggers
while env.running:
sleep(0.01)
print env.counts
return 0
def run():
n, m, iters = env.n, env.m, env.iters
counts = [0]*m
sbits = [0]*(n+m-1)
random = Random()
seed = int(fmod(time(), 2147483.648)*1000) ^ get_ident()
random.init_genrand(seed)
for S in xrange(1<<n+m-1):
i = 0
sbit = 0
while not sbit:
sbits[i] ^= 3
sbit = sbits[i]
i += 1
for i in xrange(iters):
f = 0
while not f:
F = random.genrand32()
G, x = F, 0
for k in xrange(n):
x += truth[(G&3)^sbits[k]]
f |= x
G >>= 2
if not x:
counts[0] += 1
for j in xrange(1, m):
G, x = F, 0
for k in xrange(j, n+j):
x += truth[(G&3)^sbits[k]]
G >>= 2
if x: break
counts[j] += 1
# passing True stalls until a lock can be obtained
env.lock.acquire(True)
for i in xrange(m):
env.counts[i] += counts[i]
env.running -= 1
env.lock.release()
def target(*args):
return main, None
RPython is a restricted subset of Python, which can be translated to C and then compiled using the RPython Toolchain. Its expressed purpose is to aid in the creation of language interpreters, but it can also be used to compile simple programs like the one above. Most of the 'fancier' features of Python, such as itertools
or even map
are not available.
To compile, make a local clone of the current pypy repository, and run the following:
$ pypy %PYPY_REPO%/rpython/bin/rpython --thread convolution.py
The resulting executable will be named convolution-c
or similar in the current working directory.
I've parameterized the input variables, so the program should be run as:
convolution-c 8 8 1000
to match the sample code.
Implementation Notes
S in itertools.product([-1,1], repeat = n+m-1)
becomes S in xrange(1<<n+m-1)
, interpreting S
as a bit map: [0
, 1
] → [-1
, 1
]
Likewise, F
is also a bit map, with each two bits representing a single value:
[00
, 01
, 10
, 11
] → [-1
, 0
, 0
, 1
]
A truth table is used to lookup the product, rather than performing a mulitplication.
Because 32-bit signed integers are used, n
may be no larger than 15, and n+m
no larger than 31. Arbitrary integer support can be achieved with the rpython.rlib.rbigint
module, if necessary.
The first iteration of the dot-product loop is unrolled, and combined with the nullity test of F
.
A homebrew PRNG is used, source listed. The author of the paper demonstrates a period of 232-1, and claims that it passes all Diehard tests save one, although I haven't personally confirmed this.
The random seed changes every millisecond, which is as about as good using a timestamp will allow. Additionally, each worker thread xor
s their process id with this value, to ensure that they each have a different seed.
Sample Timings
2 worker threads:
$ timeit convolution-c 8 8 1000 2
[6331845, 2526161, 1042330, 440018, 193724, 87147, 40943, 19603]
Elapsed Time: 0:00:00.375
Process Time: 0:00:00.687
System Calls: 6927
4 worker threads:
$ timeit convolution-c 8 8 1000 4
[6334565, 2527684, 1043502, 440216, 193225, 87398, 40799, 19338]
Elapsed Time: 0:00:00.218
Process Time: 0:00:00.796
System Calls: 3417
8 worker threads:
$ timeit convolution-c 8 8 1000 8
[6327639, 2522483, 1039869, 437884, 192460, 86771, 40420, 19403]
Elapsed Time: 0:00:00.187
Process Time: 0:00:00.734
System Calls: 3165
OP's original source:
$ timeit pypy convolution-orig.py
[6330610, 2525644, 1041481, 438980, 193001, 86622, 40598, 19449]
Elapsed Time: 0:01:06.718
Process Time: 0:01:06.718
System Calls: 11599808
Timing for 100000 iterations:
$ timeit convolution-c 8 8 100000 8
[633156171, 252540679, 104129386, 43903716, 19307215, 8709157, 4072133, 1959124]
Elapsed Time: 0:00:16.625
Process Time: 0:01:02.406
System Calls: 171341
-
\$\begingroup\$ I have never seen an rpython program before. This is great. Now is there an equivalent pure python program that pypy can run in 1.03s? \$\endgroup\$user9206– user92062014年04月30日 11:41:22 +00:00Commented Apr 30, 2014 at 11:41
-
\$\begingroup\$ @Lembik I'd like to see one. I thought 4.7s was pretty good, considering my first attempt at pure python was ~15s. \$\endgroup\$primo– primo2014年04月30日 13:01:29 +00:00Commented Apr 30, 2014 at 13:01
-
\$\begingroup\$ Yes, sorry for the delay. I haven't got the code running yet but will as soon as possible. \$\endgroup\$user9206– user92062014年05月02日 11:51:49 +00:00Commented May 2, 2014 at 11:51
-
\$\begingroup\$ You should try adding a JIT. Now THAT would be fast! \$\endgroup\$kirbyfan64sos– kirbyfan64sos2014年05月02日 22:21:51 +00:00Commented May 2, 2014 at 22:21
-
\$\begingroup\$ @Lembik thanks for the mention ;) Out of curiousity, did it run fastest with 4 worker threads, or 8? \$\endgroup\$primo– primo2014年05月02日 23:28:55 +00:00Commented May 2, 2014 at 23:28
Nimrod
import times, locks, strutils, unsigned
const
N = 8
M = 8
iters = 1000
numThreads = 8
type
SVec = array[0..N+M-1, int]
FVec = array[0..N-1, int]
ComputeThread = TThread[int]
var
rngSeed = int(epochTime()*1000)
totalLeadingZeros: array[0..M-1, int]
lock: TLock
type
RNGState = object
x, y, z, w: uint32
proc newRNG(seed: int): RNGState =
result.x = uint32(seed)
proc random(rng: var RNGState): int =
let t = rng.x xor (rng.x shl 11)
rng.x = rng.y; rng.y = rng.z; rng.z = rng.w
rng.w = rng.w xor (rng.w shr 19) xor t xor (t shr 8)
result = int(rng.w)
proc initVecRand(v: var FVec, rng: var RNGState) =
const values = [ -1, 0, 0, 1 ]
var rnd = rng.random
var bitAcc = 0
for i in 0 .. <len(v):
let val = values[rnd and 3]
rnd = rnd shr 2
v[i] = val
bitAcc = bitAcc or val
if bitAcc == 0:
initVecRand(v, rng)
proc convolve(s: SVec, f: FVec, offset: int): int =
for i in 0 .. <len(f):
result += s[i+offset]*f[i]
proc iterate(v: var SVec) =
for i in 0 .. <len(v):
if v[i] == -1:
v[i] = 1
return
v[i] = -1
proc mainThread(id: int) {.thread.} =
const numS = 1 shl (N+M-1)
var
s: SVec
f: FVec
leadingZeros: array[0..M-1, int]
rng = newRNG(rngSeed + id)
for k in 0 .. <len(s):
s[k] = -1
for i in 1..numS:
for j in countUp(id, iters, numThreads):
initVecRand(f, rng)
if convolve(s, f, 0) == 0:
leadingZeros[0] += 1
for k in 1 .. <M:
if convolve(s, f, k) == 0:
leadingZeros[k] += 1
else:
break
iterate(s)
acquire(lock)
for i in 0 .. <M:
totalLeadingZeros[i] += leadingZeros[i]
release(lock)
proc main =
let startTime = epochTime()
var threads: array[1..numThreads, ComputeThread]
initLock(lock)
for i in 1..numThreads:
createThread(threads[i], mainThread, i)
for i in 1..numThreads:
joinThread(threads[i])
echo("Leading zeros: ", @totalLeadingZeros)
let endTime = epochTime()
echo("Time taken: ", formatFloat(endTime - startTime, ffDecimal, 3),
" seconds")
main()
Example output:
Leading zeros: @[6333025, 2525808, 1042466, 439138, 192391, 86751, 40671, 19525]
Time taken: 0.145 seconds
Nimrod compiles to C, therefore the choice of C compiler for the backend matters, too.
Using clang, compile with:
nimrod cc --threads:on --cc=clang --passc:-flto -d:release conv.nim
Using gcc, compile with:
nimrod cc --threads:on --cc=gcc --passc:-flto -d:release conv.nim
Omit --passc:-flto
if you have an older C compiler that doesn't support LTO. Omit the --cc=...
option if you are fine with the default choice for the C compiler. The code requires Nimrod 0.9.4 or 0.9.5.
On my quadcore iMac (2.66 GHz core i5), the code runs in about .15 seconds with gcc 4.9, .16 seconds with clang, compared to 88 seconds for PyPy 2.2.1 (i.e. a 500+ times speedup). Unfortunately, I don't have access to a machine with more than four cores that also has PyPy installed or where I could easily install PyPy, though I get about .1 seconds (with a lot of measurement noise) on a 64-core AMD Opteron 6376 1.4 GHz (according to /proc/cpuinfo) with gcc 4.4.6.
The implementation tries to be faithful to the original rather than optimizing code at the cost of readability, while not forgoing obvious optimizations. Interestingly enough, tail recursion in initVecRand()
is a bit faster than a loop with a break instruction with both gcc and clang. Manually unrolling one iteration of the convolve
test loop inside the main loop also produced a speedup, presumably due to better branch prediction.
-
\$\begingroup\$ How do you get nimrod for ubuntu? \$\endgroup\$user9206– user92062014年05月01日 13:42:10 +00:00Commented May 1, 2014 at 13:42
-
\$\begingroup\$ @Lembik A quick Google search would give you nimrod-lang.org/download.html \$\endgroup\$user12205– user122052014年05月01日 13:54:39 +00:00Commented May 1, 2014 at 13:54
-
\$\begingroup\$ @ace I had also included the link in my post (though it's hard to see with blue on black now that I look at it). \$\endgroup\$Reimer Behrends– Reimer Behrends2014年05月01日 14:02:32 +00:00Commented May 1, 2014 at 14:02
-
\$\begingroup\$ You could speed this up even more by increasing the seed size to 128 bits: xorshift.di.unimi.it \$\endgroup\$flaviut– flaviut2014年05月02日 22:36:49 +00:00Commented May 2, 2014 at 22:36
Java
I translated the above C++ solution to Java:
import java.util.Random;
import java.util.Arrays;
public class Bench2 {
public static int[] bits = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
public static int[] oneValues = { 1, 0, 0, 1 };
public static int[] values = { -1, 0, 0, 1 };
public static int n = 8;
public static int m = 8;
public static int iters = 1000;
private static int x,y=34353,z=57768,w=1564;
public static void main( String[] args ) {
x = (int) (System.currentTimeMillis()/1000l);
int[] leadingzerocounts = new int[ m ];
Arrays.fill( leadingzerocounts, 0 );
int maxS = 1 << 15;
for( int s = 0; s < maxS; s++ ) {
int x = interleaveBit( s );
for( int i=0; i<iters; i++ ) {
int random;
do {
random = 0xFFFF & fastRandom( );
} while( sumOnes( random ) == 0 );
int j = 7;
while( j >= 0 ) {
int h = ( x >> (j*2) );
int l = 0xFFFF & (~(random ^ h));
if( sumArray( l ) == 0 ) {
leadingzerocounts[ j ]++;
} else {
break;
}
j--;
}
}
}
for( int i = 7; i >= 0; --i ) {
System.out.print( leadingzerocounts[ i ] + " " );
}
System.out.println( );
}
public static int interleaveBit( int x ) {
x = (x | ( x << 8)) & bits[3];
x = (x | ( x << 4)) & bits[2];
x = (x | ( x << 2)) & bits[1];
x = (x | ( x << 1)) & bits[0];
return x | (x << 1);
}
public static int sumOnes( int v ) {
return (0xAAAA & (v ^ ~(v << 1)));
// int s = 0;
// for( int i = 0; i < 8; ++i ) {
// int a = 3 & ( v >> (i*2) );
// s += oneValues[ a ];
// }
// return s;
}
public static int sumArray( int v ) {
return Integer.bitCount( v ) - 8;
// int s = 0;
// for( int i=0; i<8; ++i ) {
// int a = 3 & ( v >> (i*2) );
// s += values[ a ];
// }
// return s;
}
public static int fastRandom( ) {
long t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
}
}
On my machine I get following output for the java program:
time java Bench2
6330616 2524569 1040372 439615 193290 87131 40651 19607
java Bench2 0.36s user 0.02s system 102% cpu 0.371 total
The OPs program runs about 53 seconds on my machine:
time pypy start.py
[6330944, 2524897, 1040621, 439317, 192731, 86850, 40830, 19555]
pypy start.py 52.96s user 0.06s system 99% cpu 53.271 total
The c++ program executed only about 0.15 seconds:
time ./benchcc
[6112256, 2461184, 1025152, 435584, 193376, 87400, 40924, 19700]
./benchcc 0.15s user 0.00s system 99% cpu 0.151 total
That is about 2.5x faster than the corresponding java solution (I didn't exclude VM startup). This java solutions is about 142x faster than the program executed with PyPy.
Since I was personally interested, I set iters
to 100_000 for Java and C++ but the factor of 2.5 didn't decrease in favor of Java if anything got bigger.
EDIT: I ran the programs on a 64bit Arch Linux PC.
EDIT2: I want to add that I started with a rough translation of the python code:
import java.util.Random;
import java.util.Arrays;
public class Bench {
public static int[] values = { -1, 0, 0, 1 };
public static int n = 8;
public static int m = 8;
public static int iters = 1000;
private static int x,y=34353,z=57768,w=1564;
public static void main( String[] args ) {
x = (int) (System.currentTimeMillis()/1000l);
int[] leadingzerocounts = new int[ m ];
Arrays.fill( leadingzerocounts, 0 );
int[] S = new int[ n+m-1 ];
Arrays.fill( S, -1 );
do {
for( int i=0; i<iters; i++ ) {
int[] F = new int[ n ];
do {
randomArray( F );
} while( containsOnlyZeros( F ) );
for( int j=0; j < m && check( F, S, j ); j++ ) {
leadingzerocounts[ j ] += 1;
}
}
} while( next( S ) );
System.out.println( Arrays.toString( leadingzerocounts ) );
}
public static void randomArray( int[] F ) {
for( int i = 0; i<F.length; i++ ) {
F[ i ] = (1-(fastRandom()&3))%2;
}
}
public static boolean containsOnlyZeros( int[] F ) {
for( int x : F ) {
if( x != 0 ) {
return false;
}
}
return true;
}
public static boolean next( int[] S ) {
for( int i=0; i<S.length; i++ ) {
if( ( S[ i ] = -S[ i ] ) == 1 ) {
return true;
}
}
return false;
}
public static boolean check( int[] F, int[] S, int j ) {
int sum = 0;
for( int i=0; i<n; i++ ) {
sum += F[ i ] * S[ j + i ];
}
return sum == 0;
}
public static int fastRandom( ) {
long t;
t = x ^ (x << 11);
x = y; y = z; z = w;
return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
}
}
This program ran about 3.6 seconds:
time java Bench
[6330034, 2524369, 1040723, 439261, 193673, 87338, 40840, 19567]
java Bench 3.64s user 0.01s system 101% cpu 3.600 total
Which is about 14 times faster than the PyPy solution. (Choosing the standard random function over the fastRandom function leads to an execution time of 5 seconds)
Python 3.5 + numpy 1.10.1, 3.76 seconds
The tests were run on my Macbook Pro. OP's code took ~6 mins on the same machine.
The reason I'm answering this question in fact is because I don't have 10 reputations and can't answer Part I :-p
For the past few days, I have been trying to figure out how to perform massive convolutions efficiently with numpy (without relying on a third-party package, even scipy). As I came across this series of challenges during my research, I decided to give it a try. I may have come to this game to late, but here is my attempt using Python 3.5 and numpy 1.10.1.
def test_convolv():
n = 8
m = 8
iters = 1000
ilow = np.ceil(0+n/2).astype(int)
ihigh = np.ceil(m+n/2).astype(int)
leadingzerocounts = np.zeros(m)
# Pre-compute S & F
S = np.array(list(itertools.product([-1,1], repeat = n+m-1)))
choicesF = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*iters).reshape(iters,n)
imask = ~np.any(choicesF, axis=1)
while np.any(imask):
imasksize = np.count_nonzero(imask)
choicesF[imask,:] = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*imasksize).reshape(imasksize, n)
imask = ~np.any(choicesF, axis=1)
for i in np.arange(iters):
F = choicesF[i, :]
# This is where the magic is: by flattening the S array,
# I try to take advantage of speed of the np.convolve
# (really numpy.multiarray.correlate).
FS = (np.convolve(S.reshape(-1), F, 'same').reshape(S.shape))[:, ilow:ihigh]
jmask_not = (FS[:, 0] != 0)
leadingzerocounts[0] = leadingzerocounts[0]+np.count_nonzero(~jmask_not)
for j in np.arange(n-1)+1:
jmask = (FS[jmask_not, j] != 0)
leadingzerocounts[j] = leadingzerocounts[j] + np.count_nonzero(~jmask)
jmask_not[(jmask_not.nonzero()[0])[jmask]] = False
print(leadingzerocounts)
I pre-computed the S and F arrays, and flattened the S array while performing the convolution, which (based on my experiments) could take advantage of the speed of np.convolve. In other words, as I didn't find a vectorized convolution routine, I fake-vectorized the code by flattening the whole array and hoped np.convolved would do the vectorization under the hood for me, which seemed to be working. Note I used mode='same' and trimmed the leading and trailing elements that were useless.
On my Macbook Pro, the test results give 3.76 seconds. When I ran OP's code (modified to Python 3.5), I got about 6 minutes. The speedup is about 100 times.
One drawback is that because the S and F arrays are to be stored, the memory requirement can be a problem if the sizes are too big.
I used the same method for Part I and I got a ~ 60-100x speedup on my laptop.
As I did everything on my Macbook Pro, if someone could test my code and let me know how it goes on your machine, I would appreciate it very much!
J, (削除) 130x (削除ここまで)~50x speedup?
n =: m =: 8
len =: 1000
S =: (] - 0 = ])S0=: #:i.2^<:+/n,m
k =: (n#0) -.~ (_1 0 0 1) {~ (n#4) #: i.4^n
sn =: (]-0=])#:i.2^n
ku =: ~. k
M =: 0=+/"1 sn *"1/ ku
fs =: (ku&i.)"1 k
snum =: n #.\"1 S0
run =: 3 : 0
r =: n#0
for_t. (snum) do.
rn =: fs{~? len # #k
r =: r + +/"1*/\rn{"1 t{M
end.
r
)
echo run 0
exit''
Times on a random debian:
u#>time j slowpy.ijs
6334123 2526955 1041600 440039 193567 87321 40754 19714
real 0m2.453s
user 0m2.368s
sys 0m0.084s
u#>time python slow_pyth.py
[6331017, 2524166, 1041731, 438731, 193599, 87578, 40919, 19705]
real 5m25.541s
user 5m25.548s
sys 0m0.012s
I think there is room for improvement.
-
\$\begingroup\$ The Python script is supposed to be executed using
pypy
, notpython
, which is why your script seems to be giving 130x speed-up. \$\endgroup\$user12205– user122052014年04月29日 12:11:46 +00:00Commented Apr 29, 2014 at 12:11 -
\$\begingroup\$ @ace yes I noticed but I can't install pypy :-/ I think the order of magnitude will remain though. \$\endgroup\$Eelvex– Eelvex2014年04月29日 12:25:59 +00:00Commented Apr 29, 2014 at 12:25
-
\$\begingroup\$ Not necessarily... i.imgur.com/n566hzw.png \$\endgroup\$user12205– user122052014年04月29日 12:45:55 +00:00Commented Apr 29, 2014 at 12:45
-
\$\begingroup\$ Indeed, not necessarily. \$\endgroup\$Eelvex– Eelvex2014年04月29日 12:48:36 +00:00Commented Apr 29, 2014 at 12:48
-
\$\begingroup\$ What problem do you have installing pypy? \$\endgroup\$user9206– user92062014年04月29日 13:16:03 +00:00Commented Apr 29, 2014 at 13:16
C++: x200 (4-core i7, should scale to x400 on 8-core)
Trying for a more straightforward C++11 (Tested with VS 2012, gcc and clang) solution with parallelization.
To get this to compile and run under Linux with gcc 4.8.1:
g++ -O3 -msse -msse2 -msse3 -march=native -std=c++11 -pthread -Wl,--no-as-needed golf.cpp
Under Linux we also need std::launch::async
to force multiple threads. I was missing that in an earlier version.
In Visual Studio (2012+) this should just work but make a release build for timing...
On my oldish dual core i3 this runs in ~0.9 seconds. On my i7 quad core this is 0.319s vs. pypy 66 seconds.
On an 8-core i7 this should in the x400 speedup range. Switching to C style arrays would speed it up but I was interested in staying with C++ containers. For me it's interesting to see the speedup you can get while staying relatively close to the problem domain and at a relatively high level, something I think C++ is really good at. Also of note is the relative ease of paralleization using C++11 constructs.
@ilmale's bit solution is very cool and works for -1/1/0. One could also throw SSE at this and maybe get a significant speedup.
Beyond the parallelization there's another "trick" in there which is reducing the number of summations. Sample results: 6332947 2525357 1041957 438353 193024 87331 40902 19649
#include <vector>
#include <iostream>
#include <thread>
#include <future>
#include <time.h>
#include <ctime>
#include <algorithm>
using namespace std;
// Bring some of these constants out to share
const size_t m = 8;
const int nthreads = 16;
const size_t cn = 15;
const int two_to_cn = 32768;
static unsigned int seed = 35;
int my_random() // not thread safe but let's call that more random!
{
seed = seed*1664525UL + 1013904223UL; // numberical recipes, 32 bit
return ((seed>>30)&1)-!!((seed>>30)&2); // Credit to Dave!
}
bool allzero(const vector<int>& T)
{
for(auto x : T)
{
if(x!=0)
{
return false;
}
}
return true;
}
// Return the position of the first non-zero element
size_t convolve_until_nonzero(size_t max_n, const vector<int>& v1, const vector<int>& v2)
{
for(size_t i = 0; i<max_n; ++i)
{
int result = 0;
for(size_t j = 0; j<v2.size(); ++j)
{
result += v1[i+j]*v2[j];
}
if(result!=0)
{
return i;
}
}
return max_n;
}
void advance(vector<int>& v)
{
for(auto &x : v)
{
if(x==-1)
{
x = 1;
return;
}
x = -1;
}
}
vector<int> convolve_random_arrays(vector<int> S, int range)
{
const int iters = 1000;
int bothzero = 0;
int firstzero = 0;
time_t current_time;
time(¤t_time);
seed = current_time;
vector<int> F(m);
vector<int> leadingzerocounts(m+1);
for(auto &x: leadingzerocounts)
{
x = 0;
}
for(int i=0; i<range; ++i)
{
for(int j=0; j<iters; ++j)
{
do
{
for(auto &x : F)
{
x = my_random();
}
} while(allzero(F));
leadingzerocounts[convolve_until_nonzero(m, S, F)]++;
}
advance(S);
}
// Finish adding things up...
for(int i=m-1; i>0; --i)
{
leadingzerocounts[i] += leadingzerocounts[i+1];
}
vector<int> withoutfirst(leadingzerocounts.begin()+1, leadingzerocounts.end());
return withoutfirst;
}
int main(int argc, char* argv[])
{
vector<int> leadingzerocounts(m);
for(auto &x: leadingzerocounts)
{
x = 0;
}
clock_t start = clock();
vector<int> S(cn);
for(auto &x : S)
{
x = -1;
}
vector< future< vector< int > > > fs; // The future results of the threads
// Go make threads to work on parts of the problem
for(int i=0; i<nthreads; ++i)
{
vector<int> S_reversed = S; // S counts using LSBs but we want the thread start to be in MSBs
reverse(S_reversed.begin(), S_reversed.end());
fs.push_back(async(std::launch::async, convolve_random_arrays, S_reversed, two_to_cn/nthreads));
advance(S);
}
// And now collect the data
for(auto &f : fs)
{
vector<int> result = f.get();
for(int i=0; i<result.size(); ++i)
{
leadingzerocounts[i] += result[i];
}
}
for(auto count : leadingzerocounts)
{
cout << count << endl;
}
return 0;
}
Fortran: 316x
Okay, Fortran: I've got it up to (削除) 106x (削除ここまで) (削除) 155x (削除ここまで) (削除) 160x (削除ここまで) 316x speedup when using an Xorshift RNG and OpenMP on a 4 core i7 CPU. Other than that, there are no big tricks. For the iterator to construct S, I just use the binary representation of the 16-bit integer i. You'll note that apart from the inline RNG and the "iterator"/mapping from i to S, the code is just as high-level as the Python code.
Edit: removed the "if" in the Xorshift, now using "r = abs(w/...)" instead of "r=w/...". Goes from 106x to 155x.
Edit2: This generates 15x as many random numbers as the C++ solution. If someone has a zero-overhead solution for converting a random int into an array of 0s and 1s in Fortran, I'm all ears. Then we could beat C++ :)
Edit3: The first edit introduced a bug, as Lembik pointed out. This is fixed now, with a tiny improvement in speedup. I will try to use the suggestion by Eelvex to get more speedup.
Edit4: Profiling indicated that converting to real and back to integer with nint() was slow. I replaced this with one integer division doing both scaling and rounding, going from 160x to 316x speedup.
Compile with:
gfortran -O3 -march=native -fopenmp golf.f90
program golf
implicit none
integer, parameter :: m=8, n=8
integer :: F(n), S(m+n-1), leadingzerocounts(m)
integer :: j,k,bindec,enc,tmp,x=123456789,y=362436069,z=521288629,w=88675123
integer*2 :: i
real :: r
leadingzerocounts=0
!$OMP parallel do private(i,enc,j,bindec,S,F,k,tmp,x,y,z,w,r) reduction(+:leadingzerocounts) schedule(dynamic)
do i=0,32766
enc=i
! Short loop to convert i into the array S with -1s and 1s
do j=16,2,-1
bindec=2**(j-1)
if (enc-bindec .ge. 0) then
S(j-1)=1
enc=enc-bindec
else
S(j-1)=-1
endif
end do
do j=1,1000
F=0
do while (.not. any(F /= 0))
do k=1,n
! Start Xorshift RNG
tmp = ieor(x,ishft(x,11))
x = y
y = z
z = w
w = ieor(ieor(w,ishft(w,-19)),ieor(tmp,ishft(tmp,-8)))
! End Xorshift RNG
! Just scale it inside the nint:
!F(k)=nint(w/2147483648.0)
! Scaling by integer division is faster, but then we need the random
! number to be in (-2,2) instead of [-1,1]:
F(k)=w/1073741824
end do
end do
do k=1,m
if (dot_product(F,S(k:k+n-1)) /= 0) exit
leadingzerocounts(k)=leadingzerocounts(k)+1
end do
end do
end do
!$OMP end parallel do
print *, leadingzerocounts
end
Example output:
$ time ./a.out
6329624 2524831 1039787 438809 193044 6860 40486 19517
./a.out 1.45s user 0.00s system 746% cpu 0.192 total
OP's code:
$ time pypy golf.py
pypy golf.py 60.68s user 0.04s system 99% cpu 1:00.74 total
-
\$\begingroup\$ What I used in J was a prebuild list of 4^n numbers in base-4, then converted to triadic and excluding 0. The RNG just chooses from this list. \$\endgroup\$Eelvex– Eelvex2014年04月30日 15:23:38 +00:00Commented Apr 30, 2014 at 15:23
-
\$\begingroup\$ I am not sure your code is correct. For 100,000 iterations I get 633140285 271390368 118307997 52751245 23725837 10744292 4944464 2388125 but I think it should be closer to 633170604 252560981 104156146 43911426 19316309 8713324 4073378 1959440. This is a consistent difference between runs. \$\endgroup\$user9206– user92062014年05月01日 10:36:53 +00:00Commented May 1, 2014 at 10:36
-
1\$\begingroup\$ Ah, thanks, @Lembik, my edit to speedup (removing the if-statement) was indeed a bug. I've updated my code so it should be correct now. I will try to post a version using the suggestion by Eelvex later. \$\endgroup\$semi-extrinsic– semi-extrinsic2014年05月01日 18:29:44 +00:00Commented May 1, 2014 at 18:29
-
\$\begingroup\$ That also sped it up it seems! \$\endgroup\$user9206– user92062014年05月01日 18:34:55 +00:00Commented May 1, 2014 at 18:34
-
\$\begingroup\$ Yes, slight speedup I guess. I realized I was adding 1.0 and then subtracting 1.0 inside a tight loop. \$\endgroup\$semi-extrinsic– semi-extrinsic2014年05月01日 20:17:33 +00:00Commented May 1, 2014 at 20:17