Computational verification of Collatz conjecture using OpenCL

Question 1

This code review request follows my previous request Computational verification of Collatz conjecture. Unlike the previous program (which was designed for the CPU), this code should run on modern GPUs. For this purpose, I chose OpenCL as a programming language.

Prerequisites

Since the verification program needs to deal with 128-bit arithmetic, the first thing I was forced to solve is the availability of 128-bit integer type. So far, I rely on the __int128 compiler extension:

typedef unsigned __int128 uint128_t;

At this point my first question arises: How efficient (in terms of performance) is this solution? I also guess it may not be very portable, right?

For completeness, the

#define UINT128_MAX (~(uint128_t)0)

is maximum value for an object of type uint128_t.

I also needed to calculate the number of trailing zeros (ctz), which I solved as follows:

size_t ctzl(unsigned long n)
{
 return 63 - clz(n & -n);
}
size_t ctzu128(uint128_t n)
{
 size_t a = ctzl(n);
 if (a == ~0UL) {
 return 64 + ctzl(n>>64);
 }
 return a;
}

My concern here is that this could be implemented more simply.

The last building block I need is the n-th power of three:

uint128_t pow3(size_t n)
{
 uint128_t r = 1;
 uint128_t b = 3;
 while (n) {
 if (n & 1) {
 r *= b;
 }
 b *= b;
 n >>= 1;
 }
 return r;
}

All \$ 3^n \$ for \$ n < 81 \$ fit the uint128_t type. Therefore I have the follow macro defined.

#define LUT_SIZE128 81

Code

My code verifies the convergence of the Collatz problem using this algorithm. The range of the size 2^task_size numbers starting at task_id * 2^task_size is evenly divided among threads. The threads do not communicate with each other. When the calculation is complete, each thread stores their results (partial checksum, number of uint128_t overflows) into global memory. My current solution is as follows:

__kernel void worker(
 __global unsigned long *overflow_counter,
 __global unsigned long *checksum_alpha,
 unsigned long task_id,
 unsigned long task_size,
 unsigned long task_units)
{
 unsigned long private_overflow_counter = 0;
 unsigned long private_checksum_alpha = 0;
 size_t id = get_global_id(0);
 uint128_t lut[LUT_SIZE128];
 unsigned long i;
 for (i = 0; i < LUT_SIZE128; ++i) {
 lut[i] = pow3(i);
 }
 uint128_t n_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 0) << (task_size - task_units)) + 3;
 uint128_t n_sup_ = ((uint128_t)task_id << task_size) + ((uint128_t)(id + 1) << (task_size - task_units)) + 3;
 for (; n_ < n_sup_; n_ += 4) {
 uint128_t n = n_, n0 = n_;
 do {
 n++;
 size_t alpha = ctzu128(n);
 private_checksum_alpha += alpha;
 n >>= alpha;
 if (n > UINT128_MAX >> 2*alpha || alpha >= LUT_SIZE128) {
 private_overflow_counter++;
 break;
 }
 n *= lut[alpha];
 n--;
 n >>= ctzu128(n);
 if (n < n0) {
 break;
 }
 } while (1);
 }
 overflow_counter[id] = private_overflow_counter;
 checksum_alpha[id] = private_checksum_alpha;
}

I see an acceleration of more than two orders of magnitude compared to the CPU implementation. My main concern (about the performance) here is that the adjacent threads do not go the same code path (the control flow is not coherent for the threads of a processor). However, I am not sure how big this problem actually is.

Question 2

The lookup table lut should be hard-coded in the source code and defined outside the kernel function as a __constant space global variable. As it is now, every thread would have to recalculate the entire table. Also they take too much space for thread's private memory space. Alternately, maybe pre-calculate it on the host and pass it to the kernel in a __constant space argument.
The outer for loop should be reformulated so that every thread has the same values for the counter variables, and the number of iterations is a compile-time constant. Then applying #pragma unroll may accelerate it.
When code diverges (on NVIDIA), it can become as slow as if the diverging section (i.e. the inner do loop) on threads within each warp (= 32 adjacent threads) were executed sequentially. Maybe somehow change it so that there is no divergence inside warps.

Edit:

Maybe hardcoding the LUT and putting it into __global memory instead. The GPU would put it into global memory anyways because it is too large for the per-thread private memory. And like that the values don't have to be calculated inside the kernel.
It that does not accelerate it, maybe make a copy of it in __local memory (per work group). And copy it from global to local inside the kernel. Then use the shared lut.
Do this copy in a coalesced manner, using multiple work items. For example on NVIDIA, each n'th work item of 32 must access the n'th item of 32 from the global memory table.
Take the first (always-executed) iteration of the inner do loop outside the loop, and make it execute the next iterations (in a loop), only if it is needed for at least one work item the subgroup, using the subgroup functions.
Use an OpenCL profiler (one exists for AMD) to see where the performance losses are. Or port it to CUDA, and use the NVidia Visual Profiler (from the CUDA SDK).
Pre-calculate if 128 bit integers are needed for each work item, or if 64 bit (or 32 bit) are sufficient. Then only use 128 bit integers when any work item in the subgroup needs it.

Question 3

Thank you for the answer. This is exactly what I needed! Unfortunately, neither of these points led to acceleration (all resulted either in a small or significant slowdown). Probably it is important to note that the most inner do-while loop is executed only once in most cases, in less cases twice, in even less case three times, etc. So the diverge after the first iteration is probably (?) quite quickly resolved. Also rewriting the code so that the for-loop is controlled by exactly same values for the counter does not help. A particular value probably doesn't matter.

Question 4

The __constant space leads to the most significant slowdown. This is probably because all threads access the cache hierarchy at the same time. Leaving the LUT in on-chip memory is much faster. Maybe there's something else I could try?

Question 5

@DaBler updated the answer

Question 6

Placing the LUT into __local memory really helps a bit. The acceleration is about 3% at Tesla K20Xm and about 1% at GeForce GTX 1050 Ti. It is not much, but after a long time finally something that helped. However, initializing the LUT from global __constant memory slows the program down significantly. I guess that calculating the pow3 is much faster than accessing global memory.

Question 7

I also guess that the kernel always access the global memory in a coalesced manner. The access pattern is like array[get_global_id(0)]. Is that right?

tmlen tmlen 2341 silver badge2 bronze badges · Accepted Answer · 2019-10-14 14:12:59Z

The lookup table lut should be hard-coded in the source code and defined outside the kernel function as a __constant space global variable. As it is now, every thread would have to recalculate the entire table. Also they take too much space for thread's private memory space. Alternately, maybe pre-calculate it on the host and pass it to the kernel in a __constant space argument.
The outer for loop should be reformulated so that every thread has the same values for the counter variables, and the number of iterations is a compile-time constant. Then applying #pragma unroll may accelerate it.
When code diverges (on NVIDIA), it can become as slow as if the diverging section (i.e. the inner do loop) on threads within each warp (= 32 adjacent threads) were executed sequentially. Maybe somehow change it so that there is no divergence inside warps.

Edit:

Maybe hardcoding the LUT and putting it into __global memory instead. The GPU would put it into global memory anyways because it is too large for the per-thread private memory. And like that the values don't have to be calculated inside the kernel.
It that does not accelerate it, maybe make a copy of it in __local memory (per work group). And copy it from global to local inside the kernel. Then use the shared lut.
Do this copy in a coalesced manner, using multiple work items. For example on NVIDIA, each n'th work item of 32 must access the n'th item of 32 from the global memory table.
Take the first (always-executed) iteration of the inner do loop outside the loop, and make it execute the next iterations (in a loop), only if it is needed for at least one work item the subgroup, using the subgroup functions.
Use an OpenCL profiler (one exists for AMD) to see where the performance losses are. Or port it to CUDA, and use the NVidia Visual Profiler (from the CUDA SDK).
Pre-calculate if 128 bit integers are needed for each work item, or if 64 bit (or 32 bit) are sufficient. Then only use 128 bit integers when any work item in the subgroup needs it.

Thank you for the answer. This is exactly what I needed! Unfortunately, neither of these points led to acceleration (all resulted either in a small or significant slowdown). Probably it is important to note that the most inner do-while loop is executed only once in most cases, in less cases twice, in even less case three times, etc. So the diverge after the first iteration is probably (?) quite quickly resolved. Also rewriting the code so that the for-loop is controlled by exactly same values for the counter does not help. A particular value probably doesn't matter.
The __constant space leads to the most significant slowdown. This is probably because all threads access the cache hierarchy at the same time. Leaving the LUT in on-chip memory is much faster. Maybe there's something else I could try?
Placing the LUT into __local memory really helps a bit. The acceleration is about 3% at Tesla K20Xm and about 1% at GeForce GTX 1050 Ti. It is not much, but after a long time finally something that helped. However, initializing the LUT from global __constant memory slows the program down significantly. I guess that calculating the pow3 is much faster than accessing global memory.
I also guess that the kernel always access the global memory in a coalesced manner. The access pattern is like array[get_global_id(0)]. Is that right?

Stack Exchange Network

Computational verification of Collatz conjecture using OpenCL

Prerequisites

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Computational verification of Collatz conjecture using OpenCL

Prerequisites

Code

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions