Parallel accumulated sum in OpenCL

Question 1

This is my first relevant code in OpenCL. Please, let me know if I'm doing something wrong in a logical level (i.e., I guess it can crash in situations?) or accounting for performance (bottlenecks, bad use of resources, misunderstanding some concepts, etc).

void accumulatedSum(
 volatile __global u32* completeBlocks, 
 volatile __global u32* input, 
 volatile __global u32* output, 
 __local u32* loc) {
 // Receives an input such as: [1, 3, 0, 0, 4, 7, 0, 0, 3, 4, 1, 1]
 // Computes the accumulated sum: [1, 4, 4, 4, 8, 15, 15, 15, 18, 22, 23, 24]
 // Always call it with global_work_size = num_of_elements / 2
 // Works like this: http://imgur.com/3lm3CAq
 unsigned int k, a;
 int lid = get_local_id(0);
 int gid = get_global_id(0);
 int groupId = get_group_id(0);
 int localPasses = get_local_size(0);
 int globalPasses = get_num_groups(0)*2;
 loc[lid*2] = input[gid*2];
 loc[lid*2+1] = input[gid*2+1];
 barrier(CLK_LOCAL_MEM_FENCE);
 for (a=0; localPasses>>=1; ++a){
 k = (lid >> a << (a+1)) + (1<<a);
 loc[k + (lid&((1<<a)-1))] = loc[k - 1] + loc[k + (lid&((1<<a)-1))];
 barrier(CLK_LOCAL_MEM_FENCE);
 };
 output[gid*2] = loc[lid*2];
 output[gid*2+1] = loc[lid*2+1];
 for (; globalPasses>>=1; ++a){
 barrier(CLK_GLOBAL_MEM_FENCE);
 while (*completeBlocks != 0);
 if (lid == 0) atomic_inc(completeBlocks);
 while (*completeBlocks < get_num_groups(0));
 if (lid == 0 && gid == 0) atomic_xchg(completeBlocks,0);
 k = (gid >> a << (a+1)) + (1<<a);
 output[k + (lid&((1<<a)-1))] = output[k - 1] + output[k + (gid&((1<<a)-1))];
 };
};

If you have clever ideas in how to make it more readable please let me know, although that is not the focus.

Question 2

Could you perhaps add some surrounding code to make it executable? OpenCL is a hard beast to review, I'm assuming this is the code of a Compute Shader?

Question 3

Shouldn't the method definition start with __kernel if it is a kernel?

Question 4

What are k and a? It looks like a is a loop counter variable, in which case it can be renamed to i and moved to the front of the first for loop. As for k, rename it such that it conveys its purpose. With the exception of loop counter variables such as i, variables shouldn't be a single character.
Many of the lines in the for loops are difficult to read, especially with all the bit-shifting. Depending on the type of shift, the optimizing compiler may be able to replace some arithmetic operation with an appropriate bitshift. But for others, it would help to leave comments to explain what's being done for each complex operation.

This is quite crammed together:

for (; globalPasses>>=1; ++a){
 barrier(CLK_GLOBAL_MEM_FENCE);
 while (*completeBlocks != 0);
 if (lid == 0) atomic_inc(completeBlocks);
 while (*completeBlocks < get_num_groups(0));
 if (lid == 0 && gid == 0) atomic_xchg(completeBlocks,0);
 k = (gid >> a << (a+1)) + (1<<a);
 output[k + (lid&((1<<a)-1))] = output[k - 1] + output[k + (gid&((1<<a)-1))];
};

You should isolate the while loops so that it's easy to see that they're busy-waits. You could also consider using curly braces for the if statements. Finally, consider adding whitespace between the operators for more readability.

for (; globalPasses >>= 1; ++a) {
 barrier(CLK_GLOBAL_MEM_FENCE);
 while (*completeBlocks != 0);
 if (lid == 0) {
 atomic_inc(completeBlocks);
 }
 while (*completeBlocks < get_num_groups(0));
 if (lid == 0 && gid == 0) {
 atomic_xchg(completeBlocks, 0);
 }
 k = (gid >> a << (a + 1)) + (1 << a);
 output[k + (lid&((1 << a) - 1))] = 
 output[k - 1] + output[k + (gid&((1 < <a) - 1))];
};

Jamal Jamal 35.2k13 gold badges134 silver badges238 bronze badges · Answer 1 · 2015-12-02 21:14:29Z

What are k and a? It looks like a is a loop counter variable, in which case it can be renamed to i and moved to the front of the first for loop. As for k, rename it such that it conveys its purpose. With the exception of loop counter variables such as i, variables shouldn't be a single character.
Many of the lines in the for loops are difficult to read, especially with all the bit-shifting. Depending on the type of shift, the optimizing compiler may be able to replace some arithmetic operation with an appropriate bitshift. But for others, it would help to leave comments to explain what's being done for each complex operation.

This is quite crammed together:

for (; globalPasses>>=1; ++a){
 barrier(CLK_GLOBAL_MEM_FENCE);
 while (*completeBlocks != 0);
 if (lid == 0) atomic_inc(completeBlocks);
 while (*completeBlocks < get_num_groups(0));
 if (lid == 0 && gid == 0) atomic_xchg(completeBlocks,0);
 k = (gid >> a << (a+1)) + (1<<a);
 output[k + (lid&((1<<a)-1))] = output[k - 1] + output[k + (gid&((1<<a)-1))];
};

You should isolate the while loops so that it's easy to see that they're busy-waits. You could also consider using curly braces for the if statements. Finally, consider adding whitespace between the operators for more readability.

for (; globalPasses >>= 1; ++a) {
 barrier(CLK_GLOBAL_MEM_FENCE);
 while (*completeBlocks != 0);
 if (lid == 0) {
 atomic_inc(completeBlocks);
 }
 while (*completeBlocks < get_num_groups(0));
 if (lid == 0 && gid == 0) {
 atomic_xchg(completeBlocks, 0);
 }
 k = (gid >> a << (a + 1)) + (1 << a);
 output[k + (lid&((1 << a) - 1))] = 
 output[k - 1] + output[k + (gid&((1 < <a) - 1))];
};

Stack Exchange Network

Parallel accumulated sum in OpenCL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Parallel accumulated sum in OpenCL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions