This code review request follows my previous request Computational verification of Collatz conjecture. Unlike the previous program (which was designed for the CPU), this code should run on modern GPUs. For this purpose, I chose OpenCL as a programming language.
Prerequisites
Since the verification program needs to deal with 128-bit arithmetic, the first thing I was forced to solve is the availability of 128-bit integer type.
So far, I rely on the __int128
compiler extension:
typedef unsigned __int128 uint128_t;
At this point my first question arises: How efficient (in terms of performance) is this solution? I also guess it may not be very portable, right?
For completeness, the
#define UINT128_MAX (~(uint128_t)0)
is maximum value for an object of type uint128_t
.
I also needed to calculate the number of trailing zeros (ctz), which I solved as follows:
size_t ctzl(unsigned long n)
{
return 63 - clz(n & -n);
}
size_t ctzu128(uint128_t n)
{
size_t a = ctzl(n);
if (a == ~0UL) {
return 64 + ctzl(n>>64);
}
return a;
}
My concern here is that this could be implemented more simply.
The last building block I need is the n
-th power of three:
uint128_t pow3(size_t n)
{
uint128_t r = 1;
uint128_t b = 3;
while (n) {
if (n & 1) {
r *= b;
}
b *= b;
n >>= 1;
}
return r;
}
All \$ 3^n \$ for \$ n < 81 \$ fit the uint128_t
type.
Therefore I have the follow macro defined.
#define LUT_SIZE128 81
Code
My code verifies the convergence of the Collatz problem using this algorithm.
The range of the size 2^task_size
numbers starting at task_id * 2^task_size
is evenly divided among threads.
The threads do not communicate with each other.
When the calculation is complete, each thread stores their results (partial checksum, number of uint128_t
overflows) into global memory.
My current solution is as follows:
__kernel void worker(
__global unsigned long *overflow_counter,
__global unsigned long *checksum_alpha,
unsigned long task_id,
unsigned long task_size,
unsigned long task_units)
{
unsigned long private_overflow_counter = 0;
unsigned long private_checksum_alpha = 0;
size_t id = get_global_id(0);
uint128_t lut[LUT_SIZE128];
unsigned long i;
for (i = 0; i < LUT_SIZE128; ++i) {
lut[i] = pow3(i);
}
uint128_t n_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 0) << (task_size - task_units)) + 3;
uint128_t n_sup_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 1) << (task_size - task_units)) + 3;
for (; n_ < n_sup_; n_ += 4) {
uint128_t n = n_, n0 = n_;
do {
n++;
size_t alpha = ctzu128(n);
private_checksum_alpha += alpha;
n >>= alpha;
if (n > UINT128_MAX >> 2*alpha || alpha >= LUT_SIZE128) {
private_overflow_counter++;
break;
}
n *= lut[alpha];
n--;
n >>= ctzu128(n);
if (n < n0) {
break;
}
} while (1);
}
overflow_counter[id] = private_overflow_counter;
checksum_alpha[id] = private_checksum_alpha;
}
I see an acceleration of more than two orders of magnitude compared to the CPU implementation. My main concern (about the performance) here is that the adjacent threads do not go the same code path (the control flow is not coherent for the threads of a processor). However, I am not sure how big this problem actually is.
1 Answer 1
The lookup table
lut
should be hard-coded in the source code and defined outside the kernel function as a__constant
space global variable. As it is now, every thread would have to recalculate the entire table. Also they take too much space for thread's private memory space. Alternately, maybe pre-calculate it on the host and pass it to the kernel in a__constant
space argument.The outer
for
loop should be reformulated so that every thread has the same values for the counter variables, and the number of iterations is a compile-time constant. Then applying#pragma unroll
may accelerate it.When code diverges (on NVIDIA), it can become as slow as if the diverging section (i.e. the inner
do
loop) on threads within each warp (= 32 adjacent threads) were executed sequentially. Maybe somehow change it so that there is no divergence inside warps.
Edit:
Maybe hardcoding the LUT and putting it into
__global
memory instead. The GPU would put it into global memory anyways because it is too large for the per-thread private memory. And like that the values don't have to be calculated inside the kernel.It that does not accelerate it, maybe make a copy of it in
__local
memory (per work group). And copy it from global to local inside the kernel. Then use the sharedlut
.Do this copy in a coalesced manner, using multiple work items. For example on NVIDIA, each n'th work item of 32 must access the n'th item of 32 from the global memory table.
Take the first (always-executed) iteration of the inner
do
loop outside the loop, and make it execute the next iterations (in a loop), only if it is needed for at least one work item the subgroup, using the subgroup functions.Use an OpenCL profiler (one exists for AMD) to see where the performance losses are. Or port it to CUDA, and use the NVidia Visual Profiler (from the CUDA SDK).
Pre-calculate if 128 bit integers are needed for each work item, or if 64 bit (or 32 bit) are sufficient. Then only use 128 bit integers when any work item in the subgroup needs it.
-
\$\begingroup\$ Thank you for the answer. This is exactly what I needed! Unfortunately, neither of these points led to acceleration (all resulted either in a small or significant slowdown). Probably it is important to note that the most inner do-while loop is executed only once in most cases, in less cases twice, in even less case three times, etc. So the diverge after the first iteration is probably (?) quite quickly resolved. Also rewriting the code so that the for-loop is controlled by exactly same values for the counter does not help. A particular value probably doesn't matter. \$\endgroup\$DaBler– DaBler2019年10月16日 07:31:36 +00:00Commented Oct 16, 2019 at 7:31
-
\$\begingroup\$ The __constant space leads to the most significant slowdown. This is probably because all threads access the cache hierarchy at the same time. Leaving the LUT in on-chip memory is much faster. Maybe there's something else I could try? \$\endgroup\$DaBler– DaBler2019年10月16日 07:34:39 +00:00Commented Oct 16, 2019 at 7:34
-
\$\begingroup\$ @DaBler updated the answer \$\endgroup\$tmlen– tmlen2019年10月16日 08:21:53 +00:00Commented Oct 16, 2019 at 8:21
-
\$\begingroup\$ Placing the LUT into
__local
memory really helps a bit. The acceleration is about 3% at Tesla K20Xm and about 1% at GeForce GTX 1050 Ti. It is not much, but after a long time finally something that helped. However, initializing the LUT from global__constant
memory slows the program down significantly. I guess that calculating thepow3
is much faster than accessing global memory. \$\endgroup\$DaBler– DaBler2019年10月18日 13:40:27 +00:00Commented Oct 18, 2019 at 13:40 -
\$\begingroup\$ I also guess that the kernel always access the global memory in a coalesced manner. The access pattern is like
array[get_global_id(0)]
. Is that right? \$\endgroup\$DaBler– DaBler2019年10月18日 13:49:00 +00:00Commented Oct 18, 2019 at 13:49
Explore related questions
See similar questions with these tags.