I have a C function that generates a range from the given start
, step_size
and end
values. I use this function in a .Net app by PInvoke, therefore I want to optimize it as much as possible. I am novice SIMD programmer but intermediate in C and x86. I have written the SIMD version of it but not sure that it is well-implemented and wonder if it can be optimized further or be written more efficiently. In essence, I need a code review for the SIMD part.
typedef struct NativeFixedPointDArray_s
{
int length;
PointD_t* memory_block;
}NativeFixedPointDArray_t;
uint64_t align(uint64_t num, uint64_t alignment) {
return (num + alignment - 1) & ~(alignment - 1);
}
NativeFixedDArray_t generate_range(double start, double step_size, double end)
{
int elementsCount = (int)round((end - start) / step_size) + 1;
double* data = malloc(elementsCount * sizeof(double));
if (!data)
return (NativeFixedDArray_t) { 0, 0 };
for (int i = 0; i < elementsCount - 1; i++)
{
data[i] = start;
start += step_size;
}
data[elementsCount - 1] = end;
return (NativeFixedDArray_t) { elementsCount, data };
}
NativeFixedDArray_t generate_range_simd(double start, double step_size, double end)
{
//Calculate the number of elements that will be included in the array using start, step_size and end values.
int elementsCount = (int)round((end - start) / step_size) + 1;
// Align the number of elements with 4 for SIMD optimization.
int alignedElementsCount = align(elementsCount, 4);
double* data = malloc(alignedElementsCount * sizeof(double));
if (!data)
return (NativeFixedDArray_t) { 0, 0 };
// I need a review for this section.
__m256d start_v = _mm256_set1_pd(start);
__m256d step_size_v = _mm256_set1_pd(step_size);
for (int i = 0; i < alignedElementsCount; i += 4)
{
__m256d temp_v = _mm256_add_pd(start_v, _mm256_set_pd( step_size * (i + 3),
step_size * (i + 2),
step_size * (i + 1),
step_size * (i + 0)));
_mm256_store_pd(&data[i], temp_v);
}
data[elementsCount] = end;
return (NativeFixedDArray_t) { elementsCount, data };
}
// Here is driver program to test the functions.
int main()
{
NativeFixedDArray_t range1 = generate_range(0, 0.0002, 4.5);
NativeFixedDArray_t range2 = generate_range_simd(0, 0.0002, 4.5);
for (int i = 0; i < range1.length; ++i)
{
double value1 = range1.memory_block[i];
double value2 = range2.memory_block[i];
if (value1 != value2)
{
printf("Not equal!\n");
break;
}
printf("range1[%d]: %lf, range2[%d]: %lf\n", i, value1, i, value2);
}
return 0;
}
1 Answer 1
set
functions such as _mm256_set_pd
here are not efficient when the inputs are variable. That can be OK to do once, but here it's in a loop, and GCC 13.1 (with -O2 -mavx2
) generated such code:
lea eax, [rsi+2]
lea ecx, [rsi+1]
vmovd xmm5, esi
lea edi, [rsi+3]
vmovd xmm3, eax
vpinsrd xmm1, xmm5, ecx, 1
vpinsrd xmm3, xmm3, edi, 1
vpunpcklqdq xmm1, xmm1, xmm3
vcvtdq2pd ymm1, xmm1
vmulpd ymm1, ymm1, ymm4
GCC was a bit clever here, vectorizing the multiplication. But not clever enough, this is still expensive. Other compilers may do different things but they don't do a particularly good job with this either.
As alternatives, you could consider:
- Initializing
start_v
to 4 successive values (instead of 4 equal values) and addingstep_size * 4
to it in every iteration. This is essentially equivalent to how the scalar code implementation works. - Using the same integer index to float conversion, multiplied by the step size, but doing the whole thing with vector operations. This is more expensive than the first option, but still better than the current code, and this way there is no build-up of floating point inaccuracy as there is from adding floats in a loop.
When using the first option, you may notice that it's not as fast as it looks, that's because the floating point addition is on the critical path. That can be worked around by unrolling the loop by up to a factor of 2 to 4 (you can go higher but there shouldn't be a benefit), using as many separate accumulators as the unroll factor. As an example, after unrolling by 2:
__m256d x0_v = _mm256_set_pd(
start + step_size * 3,
start + step_size * 2,
start + step_size * 1,
start + step_size * 0);
__m256d x1_v = _mm256_set_pd(
start + step_size * 7,
start + step_size * 6,
start + step_size * 5,
start + step_size * 4);
__m256d step_size_v = _mm256_set1_pd(step_size * 8);
for (int i = 0; i < alignedElementsCount; i += 8)
{
_mm256_storeu_pd(&data[i], x0_v);
_mm256_storeu_pd(&data[i + 4], x1_v);
x0_v = _mm256_add_pd(x0_v, step_size_v);
x1_v = _mm256_add_pd(x1_v, step_size_v);
}
By the way I changed the stores to unaligned, because the malloc
'ed memory may not be sufficiently aligned. It would be better to use an aligned allocation such as _mm_malloc
or aligned_alloc
etc.
start + step_size * i
approach with thestart += stepsize
approach. \$\endgroup\$