SIMD Vectorizing C Function Generating Floating-point Range

Question 1

I have a C function that generates a range from the given start, step_size and end values. I use this function in a .Net app by PInvoke, therefore I want to optimize it as much as possible. I am novice SIMD programmer but intermediate in C and x86. I have written the SIMD version of it but not sure that it is well-implemented and wonder if it can be optimized further or be written more efficiently. In essence, I need a code review for the SIMD part.


typedef struct NativeFixedPointDArray_s
{
 int length;
 PointD_t* memory_block;
}NativeFixedPointDArray_t;
uint64_t align(uint64_t num, uint64_t alignment) {
 return (num + alignment - 1) & ~(alignment - 1);
}
NativeFixedDArray_t generate_range(double start, double step_size, double end)
{
 int elementsCount = (int)round((end - start) / step_size) + 1;
 double* data = malloc(elementsCount * sizeof(double));
 if (!data)
 return (NativeFixedDArray_t) { 0, 0 };
 for (int i = 0; i < elementsCount - 1; i++)
 {
 data[i] = start;
 start += step_size;
 }
 data[elementsCount - 1] = end;
 return (NativeFixedDArray_t) { elementsCount, data };
}
NativeFixedDArray_t generate_range_simd(double start, double step_size, double end)
{
 //Calculate the number of elements that will be included in the array using start, step_size and end values.
 int elementsCount = (int)round((end - start) / step_size) + 1;
 // Align the number of elements with 4 for SIMD optimization.
 int alignedElementsCount = align(elementsCount, 4);
 double* data = malloc(alignedElementsCount * sizeof(double));
 if (!data)
 return (NativeFixedDArray_t) { 0, 0 };
 // I need a review for this section.
 __m256d start_v = _mm256_set1_pd(start);
 __m256d step_size_v = _mm256_set1_pd(step_size);
 for (int i = 0; i < alignedElementsCount; i += 4)
 {
 __m256d temp_v = _mm256_add_pd(start_v, _mm256_set_pd( step_size * (i + 3), 
 step_size * (i + 2), 
 step_size * (i + 1),
 step_size * (i + 0)));
 _mm256_store_pd(&data[i], temp_v);
 }
 data[elementsCount] = end;
 return (NativeFixedDArray_t) { elementsCount, data };
}
// Here is driver program to test the functions.
int main()
{
 NativeFixedDArray_t range1 = generate_range(0, 0.0002, 4.5);
 NativeFixedDArray_t range2 = generate_range_simd(0, 0.0002, 4.5);
 for (int i = 0; i < range1.length; ++i)
 {
 double value1 = range1.memory_block[i];
 double value2 = range2.memory_block[i];
 if (value1 != value2)
 {
 printf("Not equal!\n");
 break;
 }
 printf("range1[%d]: %lf, range2[%d]: %lf\n", i, value1, i, value2);
 }
 return 0;
}

Question 2

Is it correct? It looks like the SIMD part mixes the "start + step_size * i approach with the start += stepsize approach.

Question 3

Yeah, it is correct although at first it looks not so. I tried and got the same result as the non-simd version.

Question 4

I don't understand how that can be true. Can you show the tests?

Question 5

Oh, you are more clever than me. I tested it again and they were not the same. I found the bug and fixed it.

Question 6

OK I'll review that then

Question 7

set functions such as _mm256_set_pd here are not efficient when the inputs are variable. That can be OK to do once, but here it's in a loop, and GCC 13.1 (with -O2 -mavx2) generated such code:

 lea eax, [rsi+2]
 lea ecx, [rsi+1]
 vmovd xmm5, esi
 lea edi, [rsi+3]
 vmovd xmm3, eax
 vpinsrd xmm1, xmm5, ecx, 1
 vpinsrd xmm3, xmm3, edi, 1
 vpunpcklqdq xmm1, xmm1, xmm3
 vcvtdq2pd ymm1, xmm1
 vmulpd ymm1, ymm1, ymm4

GCC was a bit clever here, vectorizing the multiplication. But not clever enough, this is still expensive. Other compilers may do different things but they don't do a particularly good job with this either.

As alternatives, you could consider:

Initializing start_v to 4 successive values (instead of 4 equal values) and adding step_size * 4 to it in every iteration. This is essentially equivalent to how the scalar code implementation works.
Using the same integer index to float conversion, multiplied by the step size, but doing the whole thing with vector operations. This is more expensive than the first option, but still better than the current code, and this way there is no build-up of floating point inaccuracy as there is from adding floats in a loop.

When using the first option, you may notice that it's not as fast as it looks, that's because the floating point addition is on the critical path. That can be worked around by unrolling the loop by up to a factor of 2 to 4 (you can go higher but there shouldn't be a benefit), using as many separate accumulators as the unroll factor. As an example, after unrolling by 2:

__m256d x0_v = _mm256_set_pd(
 start + step_size * 3,
 start + step_size * 2,
 start + step_size * 1,
 start + step_size * 0);
__m256d x1_v = _mm256_set_pd(
 start + step_size * 7,
 start + step_size * 6,
 start + step_size * 5,
 start + step_size * 4);
__m256d step_size_v = _mm256_set1_pd(step_size * 8);
for (int i = 0; i < alignedElementsCount; i += 8)
{
 _mm256_storeu_pd(&data[i], x0_v);
 _mm256_storeu_pd(&data[i + 4], x1_v);
 x0_v = _mm256_add_pd(x0_v, step_size_v);
 x1_v = _mm256_add_pd(x1_v, step_size_v);
}

By the way I changed the stores to unaligned, because the malloc'ed memory may not be sufficiently aligned. It would be better to use an aligned allocation such as _mm_malloc or aligned_alloc etc.

user555045 user555045 12.1k1 gold badge19 silver badges37 bronze badges · Accepted Answer · 2023-05-08 18:43:28Z

set functions such as _mm256_set_pd here are not efficient when the inputs are variable. That can be OK to do once, but here it's in a loop, and GCC 13.1 (with -O2 -mavx2) generated such code:

 lea eax, [rsi+2]
 lea ecx, [rsi+1]
 vmovd xmm5, esi
 lea edi, [rsi+3]
 vmovd xmm3, eax
 vpinsrd xmm1, xmm5, ecx, 1
 vpinsrd xmm3, xmm3, edi, 1
 vpunpcklqdq xmm1, xmm1, xmm3
 vcvtdq2pd ymm1, xmm1
 vmulpd ymm1, ymm1, ymm4

GCC was a bit clever here, vectorizing the multiplication. But not clever enough, this is still expensive. Other compilers may do different things but they don't do a particularly good job with this either.

As alternatives, you could consider:

Initializing start_v to 4 successive values (instead of 4 equal values) and adding step_size * 4 to it in every iteration. This is essentially equivalent to how the scalar code implementation works.
Using the same integer index to float conversion, multiplied by the step size, but doing the whole thing with vector operations. This is more expensive than the first option, but still better than the current code, and this way there is no build-up of floating point inaccuracy as there is from adding floats in a loop.

When using the first option, you may notice that it's not as fast as it looks, that's because the floating point addition is on the critical path. That can be worked around by unrolling the loop by up to a factor of 2 to 4 (you can go higher but there shouldn't be a benefit), using as many separate accumulators as the unroll factor. As an example, after unrolling by 2:

__m256d x0_v = _mm256_set_pd(
 start + step_size * 3,
 start + step_size * 2,
 start + step_size * 1,
 start + step_size * 0);
__m256d x1_v = _mm256_set_pd(
 start + step_size * 7,
 start + step_size * 6,
 start + step_size * 5,
 start + step_size * 4);
__m256d step_size_v = _mm256_set1_pd(step_size * 8);
for (int i = 0; i < alignedElementsCount; i += 8)
{
 _mm256_storeu_pd(&data[i], x0_v);
 _mm256_storeu_pd(&data[i + 4], x1_v);
 x0_v = _mm256_add_pd(x0_v, step_size_v);
 x1_v = _mm256_add_pd(x1_v, step_size_v);
}

By the way I changed the stores to unaligned, because the malloc'ed memory may not be sufficiently aligned. It would be better to use an aligned allocation such as _mm_malloc or aligned_alloc etc.

Stack Exchange Network

SIMD Vectorizing C Function Generating Floating-point Range

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

SIMD Vectorizing C Function Generating Floating-point Range

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions