How Slow Is Python Really (Part II)?

Question 1

This is a follow up to How slow is Python really? (Or how fast is your language?).

It turns out it was a bit too easy to get a x100 speedup for my last question. For those who have enjoyed the challenge but want something harder where they can really use their low level skills, here is part II. The challenge is to get a x100 speedup for the following python code as tested on my computer.

To make it more difficult I am using pypy this time. The current timing for me is 1 minute and 7 seconds using pypy 2.2.1.

Rules

The first person to submit code which I can run, is correct and is x100 times faster on my machine will be awarded a bounty of 50 points.
I will award the win to the fastest code after a week.

import itertools 
import operator 
import random
n = 8 
m = 8 
iters = 1000 
# creates an array of 0s with length m
# [0, 0, 0, 0, 0, 0, 0, 0]
leadingzerocounts = [0]*m
# itertools.product creates an array of all possible combinations of the 
# args passed to it.
#
# Ex:
# itertools.product("ABCD", "xy") --> Ax Ay Bx By Cx Cy Dx Dy
# itertools.product("AB", repeat=5) --> [
# ('A', 'A', 'A', 'A', 'A'),
# ('A', 'A', 'A', 'A', 'B'),
# ('A', 'A', 'A', 'B', 'A'),
# ('A', 'A', 'A', 'B', 'B'),
# etc.
# ]
for S in itertools.product([-1,1], repeat = n+m-1):
 for i in xrange(iters):
 F = [random.choice([-1,0,0,1]) for j in xrange(n)]
 # if the array is made up of only zeros keep recreating it until
 # there is at least one nonzero value.
 while not any(F):
 F = [random.choice([-1,0,0,1]) for j in xrange(n)]
 j = 0
 while (j < m and sum(map(operator.mul, F, S[j:j+n])) == 0):
 leadingzerocounts[j] +=1
 j += 1
print leadingzerocounts

The output should be similar to

[6335185, 2526840, 1041967, 439735, 193391, 87083, 40635, 19694]

You must use a random seed in your code and any random number generator that is good enough to give answers close to the above will be accepted.

My Machine The timings will be run on my machine. This is a standard ubuntu install on an AMD FX-8350 Eight-Core Processor. This also means I need to be able to run your code.

Explanation of code

This code iterates over all arrays S of length n+m-1 that are made up for -1s and 1s. For each array S it samples 1000 non-zero random arrays F of length n made up of -1,0 or 1 with probability of 1/4, 1/2,/14 of taking each values. It then computes the inner products between F and each window of S of length n until it finds a non-zero inner product. It adds 1 to leadingzerocounts at each position it found a zero inner product.

Status

Perl. 2.7 times slowdown by @tobyink. (Compared to pypy not cpython.)
J. 39 times speedup by @Eelvex.
C. 59 times speed up by @ace.
Julia. 197 times faster not including start up time by @one-more-minute. 8.5 times speed up including start up time (it's faster using 4 processors in this case than 8).
Fortran. 438 times speed up by @semi-extrinsic.
Rpython. 258 times speed up by @primo.
C++. 508 times speed up by @ilmale.

(I stopped timing the new improvements because they are too fast and iters was too small.)

It was pointed out that timings below a second are unreliable and also some languages have a start-up cost. The argument is that if you are to include that you should also include the compilation time of C/C++ etc. Here are the timings for the fastest code with the number of iterations increased to 100,000.

Julia. 42 seconds by @one-more-minute.
C++. 14 seconds by @GuySirton.
Fortran. 14s by @semi-extrinsic.
C++. 12s by @ilmale.
Rpython. 18s by @primo.
C++. 5s by @Stefan.

The winner is.. Stefan!

Follow-up challenge posted. How high can you go? (A coding+algorithms challenge) . This one is harder.

Question 2

an explanation of what the code is suppose to achieve would be nice, so we can rewrite it and not simply port it

Question 3

"The first person to submit code which I can run, is correct and is x100 times faster on my machine wins immediately and the competition closes." What is the purpose of closing the competition like that? Why not use a date deadline like most others, so we can see it furthered reduced in other languages?

Question 4

@Einacio That is a nice idea. I changed the rules which I hope no one will mind.

Question 5

@Lembik I've improved my Fortran version, making it 2x faster again on my machine. Could you time it again? :)

Question 6

@semi-extrinsic Done.

Question 7

C++ (削除) x150 (削除ここまで) (削除) x450 (削除ここまで) x530

Instead of array I used bits (and dark magic).
Thanks @ace for the faster random function.

How does it work: the first 15th bits of the integer s represent the array S[15]; the zeroes represent -1, the ones represent +1. The array F is build in a similar way. But with two bit for each symbol.

00 represent -1
01 and 10 represent 0
11 represent 1

Cause S and F have a different representation I have to interleave S with itself to be comparable with F.

0 (-1) became 00 (-1 in the representation of F)
1 (+1) became 11 (+1 in the representation of F)

Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11

0 . 00 = 11 (-1 * -1 = +1)
0 . 01 = 10 (-1 * 0 = 0)
0 . 10 = 01 (-1 * 0 = 0)
0 . 11 = 00 (-1 * +1 = -1)
1 . 00 = 00 (+1 * -1 = -1)
1 . 10 = 10 (+1 * 0 = 0)
1 . 01 = 01 (+1 * 0 = 0)
1 . 11 = 11 (+1 * +1 = +1)

Looks like a not xor to me. :)

Sum the ones is just a game of shift and mask, nothing really complex.

#include <array>
#include <ctime>
// From standford bithacks
// http://graphics.stanford.edu/~seander/bithacks.html
inline int32_t interleaveBit(int32_t x)
{
 static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
 x = (x | ( x << 8)) & B[3];
 x = (x | ( x << 4)) & B[2];
 x = (x | ( x << 2)) & B[1];
 x = (x | ( x << 1)) & B[0];
 return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
 static int b[] = { 1, 0, 0, 1};
 int s = 0;
 for( int i = 0; i < 8; ++i)
 {
 const int a = 3&(v>>(i*2));
 s += b[a];
 }
 return s;
}
inline int32_t sumArray(int32_t v)
{
 static int b[] = { -1, 0, 0, 1};
 int s = 0;
 for( int i = 0; i < 8; ++i)
 {
 const int a = 3&(v>>(i*2));
 s += b[a];
 }
 return s;
}
uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds
int32_t myRand()
{
 uint32_t t;
 t = x ^ (x<<1);
 x = y;
 y = z;
 z = w;
 w = w ^ ( w >> 19) ^ t ^ (t >> 8);
 return w;
}
int main()
{
 std::array<int, 8> leadingZero{0};
 x = static_cast<int32_t>(time(nullptr)); // seed PRNG
 const int maxS = 1 << 15;
 for(int s = 0; s < maxS; ++s)
 {
 const int32_t x = interleaveBit(s);
 for(int i = 0; i < 1000; ++i)
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 }while(sumOnes(random) == 0);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j*2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if(sumArray(l) == 0)
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }
 for(int i = 7; i >= 0; --i)
 {
 printf("%d ", leadingZero[i]);
 }
 printf("\n");
 return 0;
}

Here a sample output:

6332350 2525218 1041716 438741 192917 87159 41023 19908 
real 0m0.372s
user 0m0.371s
sys 0m0.001s

The program has been compiled with:

gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy

on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.

Probably I can gain some ms tweaking compiler parameters.

While this is the OP solution time on my machine:

time pypy 26371.py
[6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]
real 0m53.061s
user 0m53.016s
sys 0m0.022s

Edit:

Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D In this case the leadingZero array must be atomic. The random global... are random, they will be more random.

 #pragma omp parallel for
 for(int i = 0; i < 1000; ++i)
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 }while(sumOnes(random) == 0);
 for(int s = 0; s < maxS; ++s)
 {
 const int32_t x = interleaveBit(s);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j*2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if( sumArray(l) == 0 )
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }

need to add -fopenmp to the compiler flag

Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.

real 0m0.101s
user 0m0.101s
sys 0m0.000s

With openmp is slower, cause the atomics add too much overhead.

real 0m0.253s
user 0m1.870s
sys 0m0.001s

Without atomics is even faster, but I get wrong result.

2137992 1147218 619297 321243 155815 70946 32919 15579

real 0m0.048s
user 0m0.338s
sys 0m0.001s

To understand sumArray consider that 16 bit represent and array of 8 numbers.
00 have no 1 and represent -1
01 and 10 have one 1 and represent 0
11 have two 1s and represent 1
So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.

sumOnes is just black magic.

Here the latest compile flags and code.

gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++

#include <cstdint>
#include <cstdio>
#include <ctime>
inline int32_t interleaveBit(int32_t x)
{
 static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
 x = (x | ( x << 8)) & B[3];
 x = (x | ( x << 4)) & B[2];
 x = (x | ( x << 2)) & B[1];
 x = (x | ( x << 1)) & B[0];
 return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
 /* 0xAAAA == 0b1010 1010 1010 1010 */
 return !!(0xAAAA & (v ^ ~(v << 1)));
}
inline int32_t sumArray(int32_t v)
{
 return __builtin_popcount(v) - 8;
}
uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds
int32_t myRand()
{
 uint32_t t;
 t = x ^ (x << 1);
 x = y;
 y = z;
 z = w;
 w = w ^ ( w >> 19) ^ t ^ (t >> 8);
 return w;
}
int main()
{
 int leadingZero[8] = { 0 };
 x = static_cast<int32_t>(time(nullptr)); // seed PRNG
 const int maxS = 1 << 15;
 for( int i = 0; i < 1000; ++i )
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 } while(sumOnes(random) == 0 );
 for( int s = 0; s < maxS; ++s )
 {
 const int32_t x = interleaveBit(s);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j * 2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if( sumArray(l) == 0 )
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }
 printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
 leadingZero[7], leadingZero[6],
 leadingZero[5], leadingZero[4],
 leadingZero[3], leadingZero[2],
 leadingZero[1], leadingZero[0]);
 return 0;
}

Question 8

Now I can't wait to test this! Sadly this won't be for a few hours.

Question 9

The following was in a suggested edit, but I think that it may fit better as a comment. You can replace your sumOnes, sumArray with the following (seems to give me a 2x speed up over the openmp version).

inline int32_t sumOnes(int32_t v) { /* 0xAAAA == 0b1010 1010 1010 1010 */ return !! (0xAAAA & (v ^ ~(v << 1))); } inline int32_t sumArray(int32_t v) { return __builtin_popcount(v) - 8; }

this was suggested by @user71404

Question 10

@user71404: profiler said that the program spent all it's time in those two functions, but I was really tired yesterday think something better that that. I'll try this evening (UTC). Thanks.

Question 11

Would you mind changing the second code snippet to be the full copy and pastable code? I must be doing something wrong in my attempts to get your openmp code working so this would help a lot.

Question 12

Nice. I thought this could be done with bit operations.

Question 13

C++ bit magic

~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.

(speedup is based on multithreaded code on my i7-2820QM and the 1 min 9 seconds mentioned in the question. Since the OP's system has worse single threaded performance than my CPU but better multi threaded perfiormance i expect this number to be accurate)

The multithreaded part is quite inefficient due to the spawning of threads. I could probably do better by leveraging my custom job library but that one has bugs under unix systems.. For an explanation and almost identical code without threading refer to https://codegolf.stackexchange.com/a/26485/20965.

edit

I gave each thread it's own RNG and cut down the bit length to 32 which reduced the runtime by a few ms.

#include <iostream>
#include <bitset>
#include <random>
#include <chrono>
#include <stdint.h>
#include <cassert>
#include <array>
#include <tuple>
#include <memory>
#include <thread>
#include <future>
#include <string.h>
#ifdef _MSC_VER
uint32_t popcnt( uint32_t x ){ return _mm_popcnt_u32(x); }
#else
uint32_t popcnt( uint32_t x ){ return __builtin_popcount(x); }
#endif
void convolve()
{
 static const unsigned threadCount = 32;
 static const unsigned n = 8;
 static const unsigned m = 8;
 static const unsigned totalIters = 1000;
 static_assert( n <= 16, "packing of F fails when n > 16.");
 static uint32_t fmask = (1 << n) -1; fmask |= fmask << 16;
 std::array< uint32_t, m * threadCount > out;
 std::vector< std::future<void> > threads;
 for( int threadId = 0; threadId < threadCount; threadId++)
 {
 threads.emplace_back( std::async( [&, threadId]
 {
 std::random_device rd;
 std::knuth_b gen(rd());
 uint32_t nextRandomNumber = gen();
 const unsigned iters = totalIters / threadCount;
 std::array< uint32_t, m > leadingZeros;
 for( auto& x : leadingZeros )
 x = 0;
 for( unsigned i = 0; i < iters; i++ )
 {
 // generate random bit mess
 uint32_t F;
 do {
 // this funky looking construction shortens the dependancy chain of F
 F = nextRandomNumber & fmask;
 nextRandomNumber = gen();
 } while ( 0 == ((F % (1 << n)) ^ (F >> 16 )) );
 // Assume F is an array with interleaved elements such that F[0] || F[16] is one element
 // here MSB(F) & ~LSB(F) returns 1 for all elements that are positive
 // and ~MSB(F) & LSB(F) returns 1 for all elements that are negative
 // this results in the distribution ( -1, 0, 0, 1 )
 // to ease calculations we generate r = LSB(F) and l = MSB(F)
 uint32_t r = F % ( 1 << n );
 // modulo is required because the behaviour of the leftmost bit is implementation defined
 uint32_t l = ( F >> 16 ) % ( 1 << n );
 uint32_t posBits = l & ~r;
 uint32_t negBits = ~l & r;
 assert( (posBits & negBits) == 0 );
 uint32_t mask = posBits | negBits;
 uint32_t totalBits = popcnt( mask );
 // if the amount of -1 and +1's is uneven, sum(S*F) cannot possibly evaluate to 0
 if ( totalBits & 1 )
 continue;
 uint32_t adjF = posBits & ~negBits;
 uint32_t desiredBits = totalBits / 2;
 uint32_t S = (1 << (n + m -1));
 // generate all possible N+1 bit strings
 // 1 = +1
 // 0 = -1
 while ( S-- )
 {
 for( int shift = 0; shift < m; shift++ )
 {
 uint32_t s = (S >> shift) % ( 1 << n );
 auto firstBits = (s & mask) ^ adjF;
 if ( desiredBits == popcnt( firstBits ) )
 {
 leadingZeros[shift] = leadingZeros[shift] + 1;
 }
 else
 {
 break;
 }
 }
 }
 }
 memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );
 } ));
 };
 std::array< uint32_t, m > leadingZeros;
 for( auto& x : leadingZeros )
 x = 0;
 for( auto& thread : threads )
 {
 thread.wait();
 }
 for( int i = 0; i < (threadCount * m); i++ )
 {
 leadingZeros[i % m] += out[i];
 }
 for( auto x : leadingZeros )
 std::cout << x << ", ";
 std::cout << std::endl;
}
int main()
{
 typedef std::chrono::high_resolution_clock clock;
 int rounds = 100;
 // do some rounds to get the cpu up to speed..
 for( int i = 0; i < rounds / 10; i++ )
 {
 convolve();
 }
 auto start = clock::now();
 for( int i = 0; i < rounds; i++ )
 convolve();
 auto end = clock::now();
 double seconds = std::chrono::duration_cast< std::chrono::microseconds >( end - start ).count() / 1000000.0;
 std::cout << seconds/rounds*1000 << " msec/round" << std::endl;
 return 0;
}

Sample output:

 6317312, 2515072, 1034368, 434048, 190144, 85200, 39804, 19168,
 6226944, 2481408, 1031168, 438080, 192896, 86816, 40484, 19490,
 6321152, 2524672, 1045376, 442880, 195680, 88464, 41656, 20212,
 6330624, 2517504, 1031104, 430208, 187696, 83976, 38976, 18708,
 6304768, 2510336, 1030720, 433056, 190880, 86824, 40940, 19840,
 6272512, 2494720, 1028160, 432352, 189168, 84752, 39540, 19052,
 6233600, 2507520, 1046912, 447008, 198224, 89984, 42092, 20292,

Question 14

The output isn't right I think, maybe there is a bug? Compare to what is in the question. In particular the last column should be a number close to 19180.

Question 15

@Lembik i can see what you mean. I think that the random output is not random enough which creates funky output sometimes. With the C++11 random generator it works fine. I'll fix the code later today.

Question 16

I get Stefan.cpp:104:101: error: ‘memcpy’ was not declared in this scope memcpy( out.data() + (threadId * m), leadingZeros.data(), sizeof( leadingZeros[0] ) * m );

Question 17

I think that i need to include string.h. Try it again.

Question 18

You compile this with g++ -O3 -std=c++0x -pthread -Wl,--no-as-needed Stefan.cpp -o Stefan

Question 19

Julia: 0.7s, 120x faster

As user20768 demonstrated, a straightforward port of the code to Julia is about twice as fast as PyPy. But we can do a lot better than that.

function pleadingzerocounts(; n = 8,
 m = 8,
 iters = 1000)
 @parallel (+) for S = 1:2^(8+8-1)
 leading_counts = zeros(Int, m)
 F = Array(Int, n)
 for i = 1:iters
 flag = 0
 while flag == 0
 for i = 1:n
 v = (1-(rand(Int8)&3))%2
 @inbounds F[i] = v
 flag += v & 1
 end
 end
 for j = 1:m
 sum = 0
 for i = 1:n
 @inbounds sum += S & (1 << (j + i - 2)) > 0 ? F[i] : -F[i]
 end
 sum == 0 ?
 (leading_counts[j] += 1) :
 break
 end
 end
 leading_counts
 end
end
function main()
 # Warm up the JIT
 pleadingzerocounts()
 # Then go for real
 println(@time pleadingzerocounts())
end

You can run this using julia -p 8 -e 'require("golf.jl");main()' (the 8 is the number of processes, you might want to play around with it). On the latest Julia prerelease this takes 0.7s vs. 1m22s for PyPy.

If you have enough cores on your computer, and perhaps spin up a few AWS instances, you should be able to shave off some more :)

Question 20

I'm quite sure you're measuring the timing wrong. Python with Pypy is also a JIT-based language, but the timings made by the OP include the JIT compilation time. You are excluding it. I installed the latest Julia git version and tested your code, and on my machine your command with 8 processes takes 6.6 seconds to finish, but it prints "elapsed time 0.588.. seconds".

Question 21

The Python timing does include PyPy startup and JIT warmup, but that takes seconds at most – the difference over a minute of runtime is negligible. I'm happy if the OP changes the way the Python is timed (it won't make any difference), but including Julia's startup time wouldn't be reasonable.

Question 22

I asked the OP in the comments to the original question, and he said "The timings should include everything for JIT languages." He also stated he will create a new challenge where the solutions will take much longer than 1 second, leaving Julia in the competition.

Question 23

In that case the optimal solution is to use a serial algorithm – that takes about 2s. I'd publish the code but this competition is now somewhat redundant, since everyone already knows that C++ boots faster than everything else.

Question 24

I just posted my Fortran solution, so I don't see why you shouldn't post the fastest Julia one (if you already have the code).

Question 25

C, 1.210s

With OP's code running 1m45.729s on my machine.

Compilation:

gcc -O3 -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize -Wall ./test2.c -o ./test2

Special thanks: @dyp for compilation flags and ideas for optimisations.

#include <stdio.h>
#include <time.h>
#define n (8)
#define m (8)
#define iters (1000)
int leadingzerocounts[m]; // declared as global so initialised to 0
unsigned int x,y=34353,z=57768,w=1564; //PRNG seeds
/* xorshift PRNG
 * Taken from https://en.wikipedia.org/wiki/Xorshift#Example_implementation
 * Used under CC-By-SA */
int myRand() {
 unsigned int t;
 t = x ^ (x << 11);
 x = y; y = z; z = w;
 return w = w ^ (w >> 19) ^ t ^ (t >> 8);
}
int dotproduct(int*F, int*S) {
 unsigned int i;
 int sum=0;
 for(i=0; i<n; i++) {
 sum+=F[i]*S[i];
 }
 return sum;
}
int main() {
 unsigned int i, j, tmp;
 x=(int)time(NULL); //seed PRNG
 int S[n+m-1];
 for(i=0; i<(1<<(n+m-1)); i++) {
 tmp=i;
 for(j=0; j<n+m-1; j++) {
 S[j]=(tmp&1)*(-2)+1;
 tmp>>=1;
 }
 for(j=0; j<iters; j++) {
 int F[n];
 unsigned int k, flag=0;
 do {
 for(k=0; k<n; k++) {
 F[k]=(1-(myRand()&3))%2;
 flag+=(F[k]&1);
 }
 } while(!flag);
 for(k=0; k<m&&!dotproduct(F, S+k); k++) {
 leadingzerocounts[k]++;
 }
 }
 }
 for(i=0; i<m; i++) printf("%d ", leadingzerocounts[i]);
 return 0;
}

Sample output:

6334411 2527506 1042239 439328 192914 87005 40847 19728

Question 26

Interesting indeed, I can make similar observations when dropping all those optimization flags. Try -march=native -fwhole-program -fstrict-aliasing -ftree-vectorize Btw. I got down to < 4 s by using some C++11 including a MT19937 plus a uniform_int_distribution.

Question 27

1.119s making a speedup of roughly 59!

Question 28

@ace Yes, I just wanted to point this out. It was simpler for me just to try some of the standard library PRNGs in C++. Note that you can use one 32-bit integer result from a PRNG to produce 8 entries for F.

Question 29

Since n is equal to 8, you can probably use AVX (or 2*SSE) to compute the dotproduct with a proper S storage.

Question 30

SSE version, small speed-up : gist.github.com/anonymous/11394210 (dont forget to include smmintrin.h)

Question 31

Perl

This is nowhere near as fast as the C solution, but is pretty fast for a high-level interpreted language I think. It shaves about 40% off the running time of the Python implementation.

#!/usr/bin/env perl
use v5.10;
use strict;
use warnings;
use Algorithm::Combinatorics qw( variations_with_repetition );
use List::Util qw( any sum );
use constant {
 N => 8,
 M => 8,
 ITERS => 1000,
};
my @leadingzerocounts;
my $variations = variations_with_repetition([-1, 1], N + M - 1);
while (my $S = $variations->next)
{
 for my $i (1 .. ITERS)
 {
 my @F;
 until (@F and any { $_ } @F)
 {
 @F = map +((-1,0,0,1)[rand 4]), 1..N;
 }
 my $j = 0;
 while ($j < M)
 {
 last if sum map $F[$_]*$S->[$j+$_], 0..N-1;
 $leadingzerocounts[$j++]++;
 }
 }
}
say join ", ", @leadingzerocounts;

The Algorithm::Combinatorics is available in Ubuntu (sudo apt-get install libalgorithm-combinatorics-perl). The other modules used are core Perl modules, so should already be installed as part of the base Ubuntu installation.

Question 32

It won't affect speed, but it's 0..N-1 range in the last map, right? Did you forget to use warnings? :-) Though logic in OP is confusing, sliding window never gets to the last element of S.

Question 33

Ah. I just figured that the arrays were of mismatched sizes, so I disabled warnings allowing the missing elements to be treated as zero. N-1 improves this. And it does actually improve the speed very slightly — it's now about 40% faster than the Python implementation.

Question 34

I think your code requires a very modern version of List::Util. On ubuntu 14.04 I get "any" is not exported by the List::Util module

Question 35

Oh yes, that's true - you'll probably need to install List::Util off CPAN. any can alternatively be found in List::MoreUtils, which although not a core module is one of the most commonly used CPAN modules.

Question 36

Julia: 4.66x slower!

I am really beginning to doubt the statistics on their website...

Note that the following Julia code is effectively a direct transcription of the OP's Python code without any optimisations. I use the time() function to exclude Julia's slow startup time...

srand(27182818284590)
t = time()
require("Iterators")
n = 8
m = 8
iters = 1000
bothzero = 0
leadingzerocounts = zeros(m)
for S in Iterators.product(fill([-1,1], n+m-1)...)
 for i = 1:iters
 F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
 while all((x) -> x == 0, F)
 F = [[-1 0 0 1][rand(1:4)] for j = 1:n]
 end
 j = 1
 while j <= m && sum(map(*, F, S[j:j+n-1])) == 0
 leadingzerocounts[j] += 1
 j += 1
 end
 end
end
println(leadingzerocounts)
t = time() - t
println("$t seconds")

Julia: 5 m 32.912 s

OP's code in PyPy: 1 m 11.506 s

Julia output:

Question 37

Global variables, imports, and array comprehensions are slow. This isn't how one would typically write a Julia program for performance.

Question 38

Julia: 1 min 21.4s (2.2x faster) (modification of Arman's code)

Op's code in PyPy: 3 min 1.4 s

Both done in the REPL, not including time to load packages.

function foo() 
 n = 8 
 m = 8 
 iters = 1000 
 bothzero = 0 
 leadingzerocounts = zeros(Int,m) 
 P=[-1,0,0,1] 
 for S in Iterators.product(fill([-1,1], n+m-1)...) 
 Sm=[S...] 
 for i = 1:iters 
 F = P[rand(1:4,n)] 
 while all(F==0) 
 F = P[rand(1:4,n)] 
 end 
 j = 1 
 while j <= m && dot(F,Sm[j:j+n-1]) == 0 
 leadingzerocounts[j] += 1 
 j += 1 
 end 
 end 
 end 
 println(leadingzerocounts) 
end

There are some problems with Arman's code making it very slow: It uses a lot of anonymous functions and higher order functions unnecessarily. To test if all of a vector F is zero, why not just write all(F==0) instead of all(x->x==0,F)? It is shorter, and a literally a thousand times faster.

It also uses sum(map(*,x,y)) as the dot product instead of simply dot(x,y). The first version 650 times slower for a vector of 10k doubles. And the dot product function is implemented as a for loop in pure Julia.

Also, array comprehensions are slow. It is better to write [0,1,0,-1][rand(1:4,n)] instead of [[-1 0 0 1][rand(1:4)] for j = 1:n].

Finally, global variables are bad juju in Julia. Julia is only fast if you write code in such a way that allows the JIT and type inference to work. A big part of this is type stability: The compiler must be able to be sure that a variable's type will not change while inside a loop, for example.

Question 39

Thanks! I see that I still have quite a bit to learn about the Julia Language before I can go making claims about its speed :) Really glad to see that a few trivial fixes to my code increase its execution time several fold.

Question 40

RPython 0.187s (258x faster)

Original Source w/ PyPy2.2.1: 1m 6.718s

Now with threading, back-support for standard Python has been dropped. The number of worker threads can be specified as a command line parameter, default is two.

from time import time, sleep
from math import fmod
from rpython.rlib.rthread import start_new_thread, allocate_lock, get_ident
class Random:
 __slots__ = ['s']
 def __init__(self, s=1):
 self.s = s
 def init_genrand(self, seed):
 self.s = seed
 def genrand32(self):
 # xorshift PRNG with period 2^32-1
 # see http://core.kmi.open.ac.uk/download/pdf/6250138.pdf
 self.s ^= (self.s << 13)
 self.s ^= (self.s >> 17)
 self.s ^= (self.s << 5)
 return self.s
class ThreadEnv:
 __slots__ = ['n', 'm', 'iters', 'counts', 'running', 'lock']
 def __init__(self):
 self.n = 8
 self.m = 8
 self.iters = 1000
 self.counts = [0]*8
 self.running = 0
 self.lock = None
env = ThreadEnv()
truth = [-1,0,0,1]
def main(argv):
 argc = len(argv)
 if argc < 4 or argc > 5:
 print 'Usage: %s N M ITERS [NUM_THREADS=2]'%argv[0]
 return 1
 if argc == 5:
 num_threads = int(argv[4])
 else:
 num_threads = 2
 env.n = int(argv[1])
 env.m = int(argv[2])
 env.iters = int(argv[3]) // num_threads
 env.counts = [0]*env.m
 env.lock = allocate_lock()
 for i in xrange(num_threads-1):
 start_new_thread(run,())
 env.running += 1
 env.running += 1
 # use the main process as a worker
 run()
 # wait for any laggers
 while env.running:
 sleep(0.01)
 print env.counts
 return 0
def run():
 n, m, iters = env.n, env.m, env.iters
 counts = [0]*m
 sbits = [0]*(n+m-1)
 random = Random()
 seed = int(fmod(time(), 2147483.648)*1000) ^ get_ident()
 random.init_genrand(seed)
 for S in xrange(1<<n+m-1):
 i = 0
 sbit = 0
 while not sbit:
 sbits[i] ^= 3
 sbit = sbits[i]
 i += 1
 for i in xrange(iters):
 f = 0
 while not f:
 F = random.genrand32()
 G, x = F, 0
 for k in xrange(n):
 x += truth[(G&3)^sbits[k]]
 f |= x
 G >>= 2
 if not x:
 counts[0] += 1
 for j in xrange(1, m):
 G, x = F, 0
 for k in xrange(j, n+j):
 x += truth[(G&3)^sbits[k]]
 G >>= 2
 if x: break
 counts[j] += 1
 # passing True stalls until a lock can be obtained
 env.lock.acquire(True)
 for i in xrange(m):
 env.counts[i] += counts[i]
 env.running -= 1
 env.lock.release()
def target(*args):
 return main, None

RPython is a restricted subset of Python, which can be translated to C and then compiled using the RPython Toolchain. Its expressed purpose is to aid in the creation of language interpreters, but it can also be used to compile simple programs like the one above. Most of the 'fancier' features of Python, such as itertools or even map are not available.

To compile, make a local clone of the current pypy repository, and run the following:

$ pypy %PYPY_REPO%/rpython/bin/rpython --thread convolution.py

The resulting executable will be named convolution-c or similar in the current working directory.

I've parameterized the input variables, so the program should be run as:

convolution-c 8 8 1000

to match the sample code.

Implementation Notes

S in itertools.product([-1,1], repeat = n+m-1) becomes S in xrange(1<<n+m-1), interpreting S as a bit map: [0, 1] → [-1, 1]

Likewise, F is also a bit map, with each two bits representing a single value:
[00, 01, 10, 11] → [-1, 0, 0, 1]

A truth table is used to lookup the product, rather than performing a mulitplication.

Because 32-bit signed integers are used, n may be no larger than 15, and n+m no larger than 31. Arbitrary integer support can be achieved with the rpython.rlib.rbigint module, if necessary.

The first iteration of the dot-product loop is unrolled, and combined with the nullity test of F.

A homebrew PRNG is used, source listed. The author of the paper demonstrates a period of 2³²-1, and claims that it passes all Diehard tests save one, although I haven't personally confirmed this.

The random seed changes every millisecond, which is as about as good using a timestamp will allow. Additionally, each worker thread xors their process id with this value, to ensure that they each have a different seed.

Sample Timings

2 worker threads:

$ timeit convolution-c 8 8 1000 2
[6331845, 2526161, 1042330, 440018, 193724, 87147, 40943, 19603]
Elapsed Time: 0:00:00.375
Process Time: 0:00:00.687
System Calls: 6927

4 worker threads:

$ timeit convolution-c 8 8 1000 4
[6334565, 2527684, 1043502, 440216, 193225, 87398, 40799, 19338]
Elapsed Time: 0:00:00.218
Process Time: 0:00:00.796
System Calls: 3417

8 worker threads:

$ timeit convolution-c 8 8 1000 8
[6327639, 2522483, 1039869, 437884, 192460, 86771, 40420, 19403]
Elapsed Time: 0:00:00.187
Process Time: 0:00:00.734
System Calls: 3165

OP's original source:

$ timeit pypy convolution-orig.py
[6330610, 2525644, 1041481, 438980, 193001, 86622, 40598, 19449]
Elapsed Time: 0:01:06.718
Process Time: 0:01:06.718
System Calls: 11599808

Timing for 100000 iterations:

$ timeit convolution-c 8 8 100000 8
[633156171, 252540679, 104129386, 43903716, 19307215, 8709157, 4072133, 1959124]
Elapsed Time: 0:00:16.625
Process Time: 0:01:02.406
System Calls: 171341

Question 41

I have never seen an rpython program before. This is great. Now is there an equivalent pure python program that pypy can run in 1.03s?

Question 42

@Lembik I'd like to see one. I thought 4.7s was pretty good, considering my first attempt at pure python was ~15s.

Question 43

Yes, sorry for the delay. I haven't got the code running yet but will as soon as possible.

Question 44

You should try adding a JIT. Now THAT would be fast!

Question 45

@Lembik thanks for the mention ;) Out of curiousity, did it run fastest with 4 worker threads, or 8?

Question 46

Nimrod

import times, locks, strutils, unsigned
const
 N = 8
 M = 8
 iters = 1000
 numThreads = 8
type
 SVec = array[0..N+M-1, int]
 FVec = array[0..N-1, int]
 ComputeThread = TThread[int]
var
 rngSeed = int(epochTime()*1000)
 totalLeadingZeros: array[0..M-1, int]
 lock: TLock
type
 RNGState = object
 x, y, z, w: uint32
proc newRNG(seed: int): RNGState =
 result.x = uint32(seed)
proc random(rng: var RNGState): int =
 let t = rng.x xor (rng.x shl 11)
 rng.x = rng.y; rng.y = rng.z; rng.z = rng.w
 rng.w = rng.w xor (rng.w shr 19) xor t xor (t shr 8)
 result = int(rng.w)
proc initVecRand(v: var FVec, rng: var RNGState) =
 const values = [ -1, 0, 0, 1 ]
 var rnd = rng.random
 var bitAcc = 0
 for i in 0 .. <len(v):
 let val = values[rnd and 3]
 rnd = rnd shr 2
 v[i] = val
 bitAcc = bitAcc or val
 if bitAcc == 0:
 initVecRand(v, rng)
proc convolve(s: SVec, f: FVec, offset: int): int =
 for i in 0 .. <len(f):
 result += s[i+offset]*f[i]
proc iterate(v: var SVec) =
 for i in 0 .. <len(v):
 if v[i] == -1:
 v[i] = 1
 return
 v[i] = -1
proc mainThread(id: int) {.thread.} =
 const numS = 1 shl (N+M-1)
 var
 s: SVec
 f: FVec
 leadingZeros: array[0..M-1, int]
 rng = newRNG(rngSeed + id)
 for k in 0 .. <len(s):
 s[k] = -1
 for i in 1..numS:
 for j in countUp(id, iters, numThreads):
 initVecRand(f, rng)
 if convolve(s, f, 0) == 0:
 leadingZeros[0] += 1
 for k in 1 .. <M:
 if convolve(s, f, k) == 0:
 leadingZeros[k] += 1
 else:
 break
 iterate(s)
 acquire(lock)
 for i in 0 .. <M:
 totalLeadingZeros[i] += leadingZeros[i]
 release(lock)
proc main =
 let startTime = epochTime()
 var threads: array[1..numThreads, ComputeThread]
 initLock(lock)
 for i in 1..numThreads:
 createThread(threads[i], mainThread, i)
 for i in 1..numThreads:
 joinThread(threads[i])
 echo("Leading zeros: ", @totalLeadingZeros)
 let endTime = epochTime()
 echo("Time taken: ", formatFloat(endTime - startTime, ffDecimal, 3),
 " seconds")
main()

Example output:

Leading zeros: @[6333025, 2525808, 1042466, 439138, 192391, 86751, 40671, 19525]
Time taken: 0.145 seconds

Nimrod compiles to C, therefore the choice of C compiler for the backend matters, too.

Using clang, compile with:

nimrod cc --threads:on --cc=clang --passc:-flto -d:release conv.nim

Using gcc, compile with:

nimrod cc --threads:on --cc=gcc --passc:-flto -d:release conv.nim

Omit --passc:-flto if you have an older C compiler that doesn't support LTO. Omit the --cc=... option if you are fine with the default choice for the C compiler. The code requires Nimrod 0.9.4 or 0.9.5.

On my quadcore iMac (2.66 GHz core i5), the code runs in about .15 seconds with gcc 4.9, .16 seconds with clang, compared to 88 seconds for PyPy 2.2.1 (i.e. a 500+ times speedup). Unfortunately, I don't have access to a machine with more than four cores that also has PyPy installed or where I could easily install PyPy, though I get about .1 seconds (with a lot of measurement noise) on a 64-core AMD Opteron 6376 1.4 GHz (according to /proc/cpuinfo) with gcc 4.4.6.

The implementation tries to be faithful to the original rather than optimizing code at the cost of readability, while not forgoing obvious optimizations. Interestingly enough, tail recursion in initVecRand() is a bit faster than a loop with a break instruction with both gcc and clang. Manually unrolling one iteration of the convolve test loop inside the main loop also produced a speedup, presumably due to better branch prediction.

Question 47

How do you get nimrod for ubuntu?

Question 48

@Lembik A quick Google search would give you nimrod-lang.org/download.html

Question 49

@ace I had also included the link in my post (though it's hard to see with blue on black now that I look at it).

Question 50

You could speed this up even more by increasing the seed size to 128 bits: xorshift.di.unimi.it

Question 51

Java

I translated the above C++ solution to Java:

import java.util.Random;
import java.util.Arrays;
public class Bench2 {
 public static int[] bits = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
 public static int[] oneValues = { 1, 0, 0, 1 };
 public static int[] values = { -1, 0, 0, 1 };
 public static int n = 8;
 public static int m = 8;
 public static int iters = 1000;
 private static int x,y=34353,z=57768,w=1564;
 public static void main( String[] args ) {
 x = (int) (System.currentTimeMillis()/1000l);
 int[] leadingzerocounts = new int[ m ];
 Arrays.fill( leadingzerocounts, 0 );
 int maxS = 1 << 15;
 for( int s = 0; s < maxS; s++ ) {
 int x = interleaveBit( s );
 for( int i=0; i<iters; i++ ) {
 int random;
 do {
 random = 0xFFFF & fastRandom( );
 } while( sumOnes( random ) == 0 );
 int j = 7;
 while( j >= 0 ) {
 int h = ( x >> (j*2) );
 int l = 0xFFFF & (~(random ^ h));
 if( sumArray( l ) == 0 ) {
 leadingzerocounts[ j ]++;
 } else {
 break;
 }
 j--;
 }
 }
 }
 for( int i = 7; i >= 0; --i ) {
 System.out.print( leadingzerocounts[ i ] + " " );
 }
 System.out.println( );
 }
 public static int interleaveBit( int x ) {
 x = (x | ( x << 8)) & bits[3];
 x = (x | ( x << 4)) & bits[2];
 x = (x | ( x << 2)) & bits[1];
 x = (x | ( x << 1)) & bits[0];
 return x | (x << 1);
 }
 public static int sumOnes( int v ) {
 return (0xAAAA & (v ^ ~(v << 1)));
 // int s = 0;
 // for( int i = 0; i < 8; ++i ) {
 // int a = 3 & ( v >> (i*2) );
 // s += oneValues[ a ];
 // }
 // return s;
 }
 public static int sumArray( int v ) {
 return Integer.bitCount( v ) - 8;
 // int s = 0;
 // for( int i=0; i<8; ++i ) {
 // int a = 3 & ( v >> (i*2) );
 // s += values[ a ];
 // }
 // return s;
 }
 public static int fastRandom( ) {
 long t;
 t = x ^ (x << 11);
 x = y; y = z; z = w;
 return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
 }
}

On my machine I get following output for the java program:

time java Bench2
6330616 2524569 1040372 439615 193290 87131 40651 19607 
java Bench2 0.36s user 0.02s system 102% cpu 0.371 total

The OPs program runs about 53 seconds on my machine:

time pypy start.py
[6330944, 2524897, 1040621, 439317, 192731, 86850, 40830, 19555]
pypy start.py 52.96s user 0.06s system 99% cpu 53.271 total

The c++ program executed only about 0.15 seconds:

time ./benchcc
[6112256, 2461184, 1025152, 435584, 193376, 87400, 40924, 19700]
./benchcc 0.15s user 0.00s system 99% cpu 0.151 total

That is about 2.5x faster than the corresponding java solution (I didn't exclude VM startup). This java solutions is about 142x faster than the program executed with PyPy.

Since I was personally interested, I set iters to 100_000 for Java and C++ but the factor of 2.5 didn't decrease in favor of Java if anything got bigger.

EDIT: I ran the programs on a 64bit Arch Linux PC.

EDIT2: I want to add that I started with a rough translation of the python code:

import java.util.Random;
import java.util.Arrays;
public class Bench {
 public static int[] values = { -1, 0, 0, 1 };
 public static int n = 8;
 public static int m = 8;
 public static int iters = 1000;
 private static int x,y=34353,z=57768,w=1564; 
 public static void main( String[] args ) {
 x = (int) (System.currentTimeMillis()/1000l);
 int[] leadingzerocounts = new int[ m ];
 Arrays.fill( leadingzerocounts, 0 );
 int[] S = new int[ n+m-1 ];
 Arrays.fill( S, -1 );
 do {
 for( int i=0; i<iters; i++ ) {
 int[] F = new int[ n ];
 do {
 randomArray( F );
 } while( containsOnlyZeros( F ) );
 for( int j=0; j < m && check( F, S, j ); j++ ) {
 leadingzerocounts[ j ] += 1;
 }
 }
 } while( next( S ) );
 System.out.println( Arrays.toString( leadingzerocounts ) );
 }
 public static void randomArray( int[] F ) {
 for( int i = 0; i<F.length; i++ ) {
 F[ i ] = (1-(fastRandom()&3))%2;
 }
 }
 public static boolean containsOnlyZeros( int[] F ) {
 for( int x : F ) {
 if( x != 0 ) {
 return false;
 }
 }
 return true;
 }
 public static boolean next( int[] S ) {
 for( int i=0; i<S.length; i++ ) {
 if( ( S[ i ] = -S[ i ] ) == 1 ) {
 return true; 
 }
 }
 return false;
 }
 public static boolean check( int[] F, int[] S, int j ) {
 int sum = 0;
 for( int i=0; i<n; i++ ) {
 sum += F[ i ] * S[ j + i ];
 }
 return sum == 0;
 }
 public static int fastRandom( ) {
 long t;
 t = x ^ (x << 11);
 x = y; y = z; z = w;
 return w = (int)( w ^ (w >> 19) ^ t ^ (t >> 8));
 }
}

This program ran about 3.6 seconds:

time java Bench 
[6330034, 2524369, 1040723, 439261, 193673, 87338, 40840, 19567]
java Bench 3.64s user 0.01s system 101% cpu 3.600 total

Which is about 14 times faster than the PyPy solution. (Choosing the standard random function over the fastRandom function leads to an execution time of 5 seconds)

Question 52

Python 3.5 + numpy 1.10.1, 3.76 seconds

The tests were run on my Macbook Pro. OP's code took ~6 mins on the same machine.

The reason I'm answering this question in fact is because I don't have 10 reputations and can't answer Part I :-p

For the past few days, I have been trying to figure out how to perform massive convolutions efficiently with numpy (without relying on a third-party package, even scipy). As I came across this series of challenges during my research, I decided to give it a try. I may have come to this game to late, but here is my attempt using Python 3.5 and numpy 1.10.1.

def test_convolv():
 n = 8 
 m = 8 
 iters = 1000
 ilow = np.ceil(0+n/2).astype(int)
 ihigh = np.ceil(m+n/2).astype(int)
 leadingzerocounts = np.zeros(m)
 # Pre-compute S & F
 S = np.array(list(itertools.product([-1,1], repeat = n+m-1)))
 choicesF = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*iters).reshape(iters,n)
 imask = ~np.any(choicesF, axis=1)
 while np.any(imask):
 imasksize = np.count_nonzero(imask)
 choicesF[imask,:] = np.random.choice(np.array([-1, 0, 0, 1], dtype=np.int8), size=n*imasksize).reshape(imasksize, n)
 imask = ~np.any(choicesF, axis=1)
 for i in np.arange(iters):
 F = choicesF[i, :]
 # This is where the magic is: by flattening the S array, 
 # I try to take advantage of speed of the np.convolve 
 # (really numpy.multiarray.correlate). 
 FS = (np.convolve(S.reshape(-1), F, 'same').reshape(S.shape))[:, ilow:ihigh]
 jmask_not = (FS[:, 0] != 0)
 leadingzerocounts[0] = leadingzerocounts[0]+np.count_nonzero(~jmask_not)
 for j in np.arange(n-1)+1:
 jmask = (FS[jmask_not, j] != 0)
 leadingzerocounts[j] = leadingzerocounts[j] + np.count_nonzero(~jmask)
 jmask_not[(jmask_not.nonzero()[0])[jmask]] = False
 print(leadingzerocounts)

I pre-computed the S and F arrays, and flattened the S array while performing the convolution, which (based on my experiments) could take advantage of the speed of np.convolve. In other words, as I didn't find a vectorized convolution routine, I fake-vectorized the code by flattening the whole array and hoped np.convolved would do the vectorization under the hood for me, which seemed to be working. Note I used mode='same' and trimmed the leading and trailing elements that were useless.

On my Macbook Pro, the test results give 3.76 seconds. When I ran OP's code (modified to Python 3.5), I got about 6 minutes. The speedup is about 100 times.

One drawback is that because the S and F arrays are to be stored, the memory requirement can be a problem if the sizes are too big.

I used the same method for Part I and I got a ~ 60-100x speedup on my laptop.

As I did everything on my Macbook Pro, if someone could test my code and let me know how it goes on your machine, I would appreciate it very much!

Question 53

J, ^{(削除) 130x (削除ここまで)}~50x speedup?

n =: m =: 8
len =: 1000
S =: (] - 0 = ])S0=: #:i.2^<:+/n,m
k =: (n#0) -.~ (_1 0 0 1) {~ (n#4) #: i.4^n
sn =: (]-0=])#:i.2^n
ku =: ~. k
M =: 0=+/"1 sn *"1/ ku
fs =: (ku&i.)"1 k
snum =: n #.\"1 S0
run =: 3 : 0
 r =: n#0
 for_t. (snum) do.
 rn =: fs{~? len # #k
 r =: r + +/"1*/\rn{"1 t{M
 end.
 r
)
echo run 0
exit''

Times on a random debian:

u#>time j slowpy.ijs
6334123 2526955 1041600 440039 193567 87321 40754 19714
real 0m2.453s
user 0m2.368s
sys 0m0.084s
u#>time python slow_pyth.py
[6331017, 2524166, 1041731, 438731, 193599, 87578, 40919, 19705]
real 5m25.541s
user 5m25.548s
sys 0m0.012s

I think there is room for improvement.

Question 54

The Python script is supposed to be executed using pypy, not python, which is why your script seems to be giving 130x speed-up.

Question 55

@ace yes I noticed but I can't install pypy :-/ I think the order of magnitude will remain though.

Question 56

Not necessarily... i.imgur.com/n566hzw.png

Question 57

Indeed, not necessarily.

Question 58

What problem do you have installing pypy?

Question 59

C++: x200 (4-core i7, should scale to x400 on 8-core)

Trying for a more straightforward C++11 (Tested with VS 2012, gcc and clang) solution with parallelization.

To get this to compile and run under Linux with gcc 4.8.1:

g++ -O3 -msse -msse2 -msse3 -march=native -std=c++11 -pthread -Wl,--no-as-needed golf.cpp

Under Linux we also need std::launch::async to force multiple threads. I was missing that in an earlier version.

In Visual Studio (2012+) this should just work but make a release build for timing...

On my oldish dual core i3 this runs in ~0.9 seconds. On my i7 quad core this is 0.319s vs. pypy 66 seconds.

On an 8-core i7 this should in the x400 speedup range. Switching to C style arrays would speed it up but I was interested in staying with C++ containers. For me it's interesting to see the speedup you can get while staying relatively close to the problem domain and at a relatively high level, something I think C++ is really good at. Also of note is the relative ease of paralleization using C++11 constructs.

@ilmale's bit solution is very cool and works for -1/1/0. One could also throw SSE at this and maybe get a significant speedup.

Beyond the parallelization there's another "trick" in there which is reducing the number of summations. Sample results: 6332947 2525357 1041957 438353 193024 87331 40902 19649

#include <vector>
#include <iostream>
#include <thread>
#include <future>
#include <time.h>
#include <ctime>
#include <algorithm>
using namespace std;
// Bring some of these constants out to share
const size_t m = 8;
const int nthreads = 16;
const size_t cn = 15;
const int two_to_cn = 32768;
static unsigned int seed = 35;
int my_random() // not thread safe but let's call that more random!
{
 seed = seed*1664525UL + 1013904223UL; // numberical recipes, 32 bit
 return ((seed>>30)&1)-!!((seed>>30)&2); // Credit to Dave!
}
bool allzero(const vector<int>& T)
{
 for(auto x : T)
 {
 if(x!=0)
 {
 return false;
 }
 }
 return true;
}
// Return the position of the first non-zero element
size_t convolve_until_nonzero(size_t max_n, const vector<int>& v1, const vector<int>& v2)
{
 for(size_t i = 0; i<max_n; ++i)
 {
 int result = 0;
 for(size_t j = 0; j<v2.size(); ++j)
 {
 result += v1[i+j]*v2[j];
 }
 if(result!=0)
 {
 return i;
 }
 }
 return max_n;
}
void advance(vector<int>& v)
{
 for(auto &x : v)
 {
 if(x==-1)
 {
 x = 1;
 return;
 }
 x = -1;
 }
}
vector<int> convolve_random_arrays(vector<int> S, int range)
{
 const int iters = 1000;
 int bothzero = 0;
 int firstzero = 0;
 time_t current_time;
 time(&current_time);
 seed = current_time;
 vector<int> F(m);
 vector<int> leadingzerocounts(m+1);
 for(auto &x: leadingzerocounts)
 {
 x = 0;
 }
 for(int i=0; i<range; ++i)
 {
 for(int j=0; j<iters; ++j)
 {
 do
 {
 for(auto &x : F)
 {
 x = my_random();
 }
 } while(allzero(F));
 leadingzerocounts[convolve_until_nonzero(m, S, F)]++;
 }
 advance(S);
 }
 // Finish adding things up...
 for(int i=m-1; i>0; --i)
 {
 leadingzerocounts[i] += leadingzerocounts[i+1];
 }
 vector<int> withoutfirst(leadingzerocounts.begin()+1, leadingzerocounts.end());
 return withoutfirst;
}
int main(int argc, char* argv[])
{
 vector<int> leadingzerocounts(m);
 for(auto &x: leadingzerocounts)
 {
 x = 0;
 }
 clock_t start = clock();
 vector<int> S(cn);
 for(auto &x : S)
 {
 x = -1;
 }
 vector< future< vector< int > > > fs; // The future results of the threads
 // Go make threads to work on parts of the problem
 for(int i=0; i<nthreads; ++i)
 {
 vector<int> S_reversed = S; // S counts using LSBs but we want the thread start to be in MSBs
 reverse(S_reversed.begin(), S_reversed.end());
 fs.push_back(async(std::launch::async, convolve_random_arrays, S_reversed, two_to_cn/nthreads));
 advance(S);
 }
 // And now collect the data
 for(auto &f : fs)
 {
 vector<int> result = f.get();
 for(int i=0; i<result.size(); ++i)
 {
 leadingzerocounts[i] += result[i];
 }
 }
 for(auto count : leadingzerocounts)
 {
 cout << count << endl;
 }
 return 0;
}

Question 60

Fortran: 316x

Okay, Fortran: I've got it up to ~~(削除) 106x (削除ここまで)~~ ~~(削除) 155x (削除ここまで)~~ ~~(削除) 160x (削除ここまで)~~ 316x speedup when using an Xorshift RNG and OpenMP on a 4 core i7 CPU. Other than that, there are no big tricks. For the iterator to construct S, I just use the binary representation of the 16-bit integer i. You'll note that apart from the inline RNG and the "iterator"/mapping from i to S, the code is just as high-level as the Python code.

Edit: removed the "if" in the Xorshift, now using "r = abs(w/...)" instead of "r=w/...". Goes from 106x to 155x.

Edit2: This generates 15x as many random numbers as the C++ solution. If someone has a zero-overhead solution for converting a random int into an array of 0s and 1s in Fortran, I'm all ears. Then we could beat C++ :)

Edit3: The first edit introduced a bug, as Lembik pointed out. This is fixed now, with a tiny improvement in speedup. I will try to use the suggestion by Eelvex to get more speedup.

Edit4: Profiling indicated that converting to real and back to integer with nint() was slow. I replaced this with one integer division doing both scaling and rounding, going from 160x to 316x speedup.

Compile with:

gfortran -O3 -march=native -fopenmp golf.f90

program golf
implicit none
integer, parameter :: m=8, n=8
integer :: F(n), S(m+n-1), leadingzerocounts(m)
integer :: j,k,bindec,enc,tmp,x=123456789,y=362436069,z=521288629,w=88675123
integer*2 :: i
real :: r
leadingzerocounts=0
!$OMP parallel do private(i,enc,j,bindec,S,F,k,tmp,x,y,z,w,r) reduction(+:leadingzerocounts) schedule(dynamic)
do i=0,32766
 enc=i
 ! Short loop to convert i into the array S with -1s and 1s
 do j=16,2,-1
 bindec=2**(j-1)
 if (enc-bindec .ge. 0) then
 S(j-1)=1
 enc=enc-bindec
 else
 S(j-1)=-1
 endif
 end do
 do j=1,1000
 F=0
 do while (.not. any(F /= 0))
 do k=1,n
 ! Start Xorshift RNG
 tmp = ieor(x,ishft(x,11))
 x = y
 y = z
 z = w
 w = ieor(ieor(w,ishft(w,-19)),ieor(tmp,ishft(tmp,-8)))
 ! End Xorshift RNG
 ! Just scale it inside the nint:
 !F(k)=nint(w/2147483648.0)
 ! Scaling by integer division is faster, but then we need the random 
 ! number to be in (-2,2) instead of [-1,1]:
 F(k)=w/1073741824
 end do
 end do
 do k=1,m
 if (dot_product(F,S(k:k+n-1)) /= 0) exit
 leadingzerocounts(k)=leadingzerocounts(k)+1
 end do
 end do
end do
!$OMP end parallel do
print *, leadingzerocounts
end

Example output:

$ time ./a.out
6329624 2524831 1039787 438809 193044 6860 40486 19517
./a.out 1.45s user 0.00s system 746% cpu 0.192 total

OP's code:

$ time pypy golf.py
pypy golf.py 60.68s user 0.04s system 99% cpu 1:00.74 total

Question 61

What I used in J was a prebuild list of 4^n numbers in base-4, then converted to triadic and excluding 0. The RNG just chooses from this list.

Question 62

I am not sure your code is correct. For 100,000 iterations I get 633140285 271390368 118307997 52751245 23725837 10744292 4944464 2388125 but I think it should be closer to 633170604 252560981 104156146 43911426 19316309 8713324 4073378 1959440. This is a consistent difference between runs.

Question 63

Ah, thanks, @Lembik, my edit to speedup (removing the if-statement) was indeed a bug. I've updated my code so it should be correct now. I will try to post a version using the suggestion by Eelvex later.

Question 64

That also sped it up it seems!

ilmale ilmale 1,1476 silver badges12 bronze badges · Accepted Answer · 2014-04-29 22:45:35Z

C++ (削除) x150 (削除ここまで) (削除) x450 (削除ここまで) x530

Instead of array I used bits (and dark magic).
Thanks @ace for the faster random function.

How does it work: the first 15th bits of the integer s represent the array S[15]; the zeroes represent -1, the ones represent +1. The array F is build in a similar way. But with two bit for each symbol.

00 represent -1
01 and 10 represent 0
11 represent 1

Cause S and F have a different representation I have to interleave S with itself to be comparable with F.

0 (-1) became 00 (-1 in the representation of F)
1 (+1) became 11 (+1 in the representation of F)

Now we can simply use Carnot to compute the inner product. Remember that one variable can only assume value 00 or 11

0 . 00 = 11 (-1 * -1 = +1)
0 . 01 = 10 (-1 * 0 = 0)
0 . 10 = 01 (-1 * 0 = 0)
0 . 11 = 00 (-1 * +1 = -1)
1 . 00 = 00 (+1 * -1 = -1)
1 . 10 = 10 (+1 * 0 = 0)
1 . 01 = 01 (+1 * 0 = 0)
1 . 11 = 11 (+1 * +1 = +1)

Looks like a not xor to me. :)

Sum the ones is just a game of shift and mask, nothing really complex.

#include <array>
#include <ctime>
// From standford bithacks
// http://graphics.stanford.edu/~seander/bithacks.html
inline int32_t interleaveBit(int32_t x)
{
 static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
 x = (x | ( x << 8)) & B[3];
 x = (x | ( x << 4)) & B[2];
 x = (x | ( x << 2)) & B[1];
 x = (x | ( x << 1)) & B[0];
 return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
 static int b[] = { 1, 0, 0, 1};
 int s = 0;
 for( int i = 0; i < 8; ++i)
 {
 const int a = 3&(v>>(i*2));
 s += b[a];
 }
 return s;
}
inline int32_t sumArray(int32_t v)
{
 static int b[] = { -1, 0, 0, 1};
 int s = 0;
 for( int i = 0; i < 8; ++i)
 {
 const int a = 3&(v>>(i*2));
 s += b[a];
 }
 return s;
}
uint32_t x, y = 24252, z=57768, w=1564; //PRNG seeds
int32_t myRand()
{
 uint32_t t;
 t = x ^ (x<<1);
 x = y;
 y = z;
 z = w;
 w = w ^ ( w >> 19) ^ t ^ (t >> 8);
 return w;
}
int main()
{
 std::array<int, 8> leadingZero{0};
 x = static_cast<int32_t>(time(nullptr)); // seed PRNG
 const int maxS = 1 << 15;
 for(int s = 0; s < maxS; ++s)
 {
 const int32_t x = interleaveBit(s);
 for(int i = 0; i < 1000; ++i)
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 }while(sumOnes(random) == 0);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j*2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if(sumArray(l) == 0)
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }
 for(int i = 7; i >= 0; --i)
 {
 printf("%d ", leadingZero[i]);
 }
 printf("\n");
 return 0;
}

Here a sample output:

6332350 2525218 1041716 438741 192917 87159 41023 19908 
real 0m0.372s
user 0m0.371s
sys 0m0.001s

The program has been compiled with:

gcc -std=c++11 -O3 -msse4.2 -Wall -lstdc++ 26371.cpp -o fastPy

on Fedora 20 with gcc 4.8.2 The Cpu is a i7 8core.

Probably I can gain some ms tweaking compiler parameters.

While this is the OP solution time on my machine:

time pypy 26371.py
[6330609, 2523914, 1040885, 439303, 192708, 86987, 40710, 19498]
real 0m53.061s
user 0m53.016s
sys 0m0.022s

Edit:

Just adding openmp and change the order of the for I have a x3 gain, leading to a x450 performance improvement against OP code. :D In this case the leadingZero array must be atomic. The random global... are random, they will be more random.

 #pragma omp parallel for
 for(int i = 0; i < 1000; ++i)
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 }while(sumOnes(random) == 0);
 for(int s = 0; s < maxS; ++s)
 {
 const int32_t x = interleaveBit(s);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j*2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if( sumArray(l) == 0 )
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }

need to add -fopenmp to the compiler flag

Edit:2 As suggester by user71404 I changed the sumOnes and sumArray functions and now it's uber fast.

real 0m0.101s
user 0m0.101s
sys 0m0.000s

With openmp is slower, cause the atomics add too much overhead.

real 0m0.253s
user 0m1.870s
sys 0m0.001s

Without atomics is even faster, but I get wrong result.

2137992 1147218 619297 321243 155815 70946 32919 15579

real 0m0.048s
user 0m0.338s
sys 0m0.001s

To understand sumArray consider that 16 bit represent and array of 8 numbers.
00 have no 1 and represent -1
01 and 10 have one 1 and represent 0
11 have two 1s and represent 1
So that built-in count the number of bit set to 1 [http://en.wikipedia.org/wiki/Hamming_weight] and to each group we remove 1. Cool.

sumOnes is just black magic.

Here the latest compile flags and code.

gcc -std=c++11 -mfpmath=sse -O3 -flto -march=native -funroll-loops -Wall -lstdc++

#include <cstdint>
#include <cstdio>
#include <ctime>
inline int32_t interleaveBit(int32_t x)
{
 static const uint32_t B[] = { 0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF };
 x = (x | ( x << 8)) & B[3];
 x = (x | ( x << 4)) & B[2];
 x = (x | ( x << 2)) & B[1];
 x = (x | ( x << 1)) & B[0];
 return x | (x << 1);
}
inline int32_t sumOnes(int32_t v)
{
 /* 0xAAAA == 0b1010 1010 1010 1010 */
 return !!(0xAAAA & (v ^ ~(v << 1)));
}
inline int32_t sumArray(int32_t v)
{
 return __builtin_popcount(v) - 8;
}
uint32_t x, y = 24252, z = 57768, w = 1564; //PRNG seeds
int32_t myRand()
{
 uint32_t t;
 t = x ^ (x << 1);
 x = y;
 y = z;
 z = w;
 w = w ^ ( w >> 19) ^ t ^ (t >> 8);
 return w;
}
int main()
{
 int leadingZero[8] = { 0 };
 x = static_cast<int32_t>(time(nullptr)); // seed PRNG
 const int maxS = 1 << 15;
 for( int i = 0; i < 1000; ++i )
 {
 int32_t random;
 do
 {
 random = 0xFFFF & myRand();
 } while(sumOnes(random) == 0 );
 for( int s = 0; s < maxS; ++s )
 {
 const int32_t x = interleaveBit(s);
 int j = 7;
 while( j >= 0 )
 {
 const int32_t h = (x >> (j * 2));
 const int32_t l = 0xFFFF & (~(random ^ h)); // inner product
 if( sumArray(l) == 0 )
 {
 leadingZero[j]++;
 } else
 {
 break;
 }
 j--;
 }
 }
 }
 printf("[%d, %d, %d, %d, %d, %d, %d, %d]\n",
 leadingZero[7], leadingZero[6],
 leadingZero[5], leadingZero[4],
 leadingZero[3], leadingZero[2],
 leadingZero[1], leadingZero[0]);
 return 0;
}

Now I can't wait to test this! Sadly this won't be for a few hours.
The following was in a suggested edit, but I think that it may fit better as a comment. You can replace your sumOnes, sumArray with the following (seems to give me a 2x speed up over the openmp version). inline int32_t sumOnes(int32_t v) { /* 0xAAAA == 0b1010 1010 1010 1010 */ return !! (0xAAAA & (v ^ ~(v << 1))); } inline int32_t sumArray(int32_t v) { return __builtin_popcount(v) - 8; } this was suggested by @user71404
@user71404: profiler said that the program spent all it's time in those two functions, but I was really tired yesterday think something better that that. I'll try this evening (UTC). Thanks.
Would you mind changing the second code snippet to be the full copy and pastable code? I must be doing something wrong in my attempts to get your openmp code working so this would help a lot.

Stack Exchange Network

How Slow Is Python Really (Part II)?

The winner is.. Stefan!

14 Answers 14

C++ (削除) x150 (削除ここまで) (削除) x450 (削除ここまで) x530

C++ bit magic

~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.

edit

Julia: 0.7s, 120x faster

C, 1.210s

Julia: 4.66x slower!

RPython 0.187s (258x faster)

Nimrod

Java

J, ^{(削除) 130x (削除ここまで)}~50x speedup?

Fortran: 316x

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How Slow Is Python Really (Part II)?

The winner is.. Stefan!

14 Answers 14

C++ (削除) x150 (削除ここまで) (削除) x450 (削除ここまで) x530

C++ bit magic

~16ms multithreaded, 56ms singlethreaded. ~4000 speedup.

edit

Julia: 0.7s, 120x faster

C, 1.210s

Julia: 4.66x slower!

RPython 0.187s (258x faster)

Nimrod

Java

J, (削除) 130x (削除ここまで)~50x speedup?

Fortran: 316x

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

J, ^{(削除) 130x (削除ここまで)}~50x speedup?