This is my first relevant code in OpenCL. Please, let me know if I'm doing something wrong in a logical level (i.e., I guess it can crash in situations?) or accounting for performance (bottlenecks, bad use of resources, misunderstanding some concepts, etc).
void accumulatedSum(
volatile __global u32* completeBlocks,
volatile __global u32* input,
volatile __global u32* output,
__local u32* loc) {
// Receives an input such as: [1, 3, 0, 0, 4, 7, 0, 0, 3, 4, 1, 1]
// Computes the accumulated sum: [1, 4, 4, 4, 8, 15, 15, 15, 18, 22, 23, 24]
// Always call it with global_work_size = num_of_elements / 2
// Works like this: http://imgur.com/3lm3CAq
unsigned int k, a;
int lid = get_local_id(0);
int gid = get_global_id(0);
int groupId = get_group_id(0);
int localPasses = get_local_size(0);
int globalPasses = get_num_groups(0)*2;
loc[lid*2] = input[gid*2];
loc[lid*2+1] = input[gid*2+1];
barrier(CLK_LOCAL_MEM_FENCE);
for (a=0; localPasses>>=1; ++a){
k = (lid >> a << (a+1)) + (1<<a);
loc[k + (lid&((1<<a)-1))] = loc[k - 1] + loc[k + (lid&((1<<a)-1))];
barrier(CLK_LOCAL_MEM_FENCE);
};
output[gid*2] = loc[lid*2];
output[gid*2+1] = loc[lid*2+1];
for (; globalPasses>>=1; ++a){
barrier(CLK_GLOBAL_MEM_FENCE);
while (*completeBlocks != 0);
if (lid == 0) atomic_inc(completeBlocks);
while (*completeBlocks < get_num_groups(0));
if (lid == 0 && gid == 0) atomic_xchg(completeBlocks,0);
k = (gid >> a << (a+1)) + (1<<a);
output[k + (lid&((1<<a)-1))] = output[k - 1] + output[k + (gid&((1<<a)-1))];
};
};
If you have clever ideas in how to make it more readable please let me know, although that is not the focus.
1 Answer 1
What are
k
anda
? It looks likea
is a loop counter variable, in which case it can be renamed toi
and moved to the front of the firstfor
loop. As fork
, rename it such that it conveys its purpose. With the exception of loop counter variables such asi
, variables shouldn't be a single character.Many of the lines in the
for
loops are difficult to read, especially with all the bit-shifting. Depending on the type of shift, the optimizing compiler may be able to replace some arithmetic operation with an appropriate bitshift. But for others, it would help to leave comments to explain what's being done for each complex operation.This is quite crammed together:
for (; globalPasses>>=1; ++a){ barrier(CLK_GLOBAL_MEM_FENCE); while (*completeBlocks != 0); if (lid == 0) atomic_inc(completeBlocks); while (*completeBlocks < get_num_groups(0)); if (lid == 0 && gid == 0) atomic_xchg(completeBlocks,0); k = (gid >> a << (a+1)) + (1<<a); output[k + (lid&((1<<a)-1))] = output[k - 1] + output[k + (gid&((1<<a)-1))]; };
You should isolate the
while
loops so that it's easy to see that they're busy-waits. You could also consider using curly braces for theif
statements. Finally, consider adding whitespace between the operators for more readability.for (; globalPasses >>= 1; ++a) { barrier(CLK_GLOBAL_MEM_FENCE); while (*completeBlocks != 0); if (lid == 0) { atomic_inc(completeBlocks); } while (*completeBlocks < get_num_groups(0)); if (lid == 0 && gid == 0) { atomic_xchg(completeBlocks, 0); } k = (gid >> a << (a + 1)) + (1 << a); output[k + (lid&((1 << a) - 1))] = output[k - 1] + output[k + (gid&((1 < <a) - 1))]; };
__kernel
if it is a kernel? \$\endgroup\$