Multiplying an array by a number in C and how can achive vectorize the principal loop in this function?

Question 1

I have code in C that multiplies each element of an array by a number (0-9), resulting in a series of base 10 digits.

I compile with

gcc -xc -Ofast -msse2 -flax-vector-conversions -ffast-math -funroll-all-loops --param max-unroll-times=50 -ftree-vectorize -fopt-info-vec-missed

My processor is core i7 950. My problem is that this function takes longer to run that I expected (8 seconds in my faster version). I need it to be faster.

I know that my problem is that can't vectorize the principal loop for because of this error: not vectorized: not suitable for gather load x_60 = table[_59]. How can this code be modified to solve this problem and make the code faster? It's fine for the solution to use intrinsics or other specialized techniques.

Compilable code with different tests here.

My fastest version so far is this:

 uint8_t ConstMul(uint8_t *V, size_t N, uint8_t digit){
 #define TABLE_SIZE ((9*256 + 9)*9 + 9 + 1)
 static uint32_t table[TABLE_SIZE];
 if(!table[1]){
 #pragma simd
 for(uint32_t i = 0; i < TABLE_SIZE; ++i){
 uint32_t u = i % 256 % 10;
 uint32_t d = (i / 256 + i % 256 / 10) % 10;
 uint32_t c = (i / 256 + i % 256 / 10) / 10;
 table[i] = c | (u << 8)|(d << 16);
 }
 }
 if(N == 0 || digit <= 1){
 if(digit == 0) memset(V,0,N);
 return 0;
 }else{
 size_t CARRY = 0;
 if((uintptr_t)V & 1){
 int R = V[0] * digit + (uint8_t)CARRY;
 CARRY = (uint8_t)(R / 10);
 V[0] = (uint8_t)(R - CARRY * 10);
 V++;
 N--;
 }
 {
 uint16_t *V2 = (uint16_t *)(void *)V;
 size_t N2 = N / 2;
 for(size_t i = 0; i < N2; ++i){
 uint32_t x = table[V2[i] * digit + CARRY];
 V2[i] = (uint16_t)(x >> 8);
 CARRY = (uint8_t)x;
 }
 }
 if(N & 1){
 int R = V[N-1]*digit + (uint8_t)CARRY;
 CARRY = (uint8_t)(R/10);
 V[N-1] = (uint8_t)(R - CARRY * 10);
 }
 return (uint8_t)CARRY;
 }
 #undef TABLE_SIZE
}

But I also tried these approaches which were slower:

void ConstMult( uint8_t *V, size_t N, uint8_t digit )
 {
 uint8_t CARRY = 0;
 for ( size_t i=0; i< N; ++i )
 {
 V[i] = V[i] * digit + CARRY;
 CARRY = ((uint32_t)V[i] * (uint32_t)0xCCCD) >> 19;
 V[i] -= (CARRY << 3) + (CARRY << 1);
 }
 }
uint8_t ConstMult( uint8_t *V, size_t N, uint8_t digit )
{
 uint8_t CARRY = 0;
 for ( int i=0; i< N; i++ ) 
 {
 char R = V[i] * digit + CARRY;
 CARRY = R / 10;
 R = R - CARRY*10;
 V[i] = R;
 }
 return CARRY; // may be from 0 to 9
}
uint8_t ConstMult(uint8_t *V, size_t N, uint8_t digit)
{
 uint8_t CARRY = 0;
 uint8_t ja = 0;
 for (size_t i = 0; i < N; ++i) {
 uint8_t aux = V[i] * digit;
 uint8_t R = aux + CARRY;
 CARRY = ((u_int32_t)R*(u_int32_t)0xCCCD) >> 19;
 ja = (CARRY << 3) + 2*CARRY;
 R -= ja;
 V[i] = R;
 }
 return CARRY;
}

Question 2

Welcome to code review where we review complete working code from projects. Please show all of the code you are using to test the function. It would also be very helpful for the review if we knew more about the data in the vector and the size of N. Right now we don't have enough data or code to do a good review, and that makes this question off-topic for code review.

Question 3

Do you mind showing your benchmark code?

Question 4

Also, can I ask why you didn't start off using a BLAS library?

Question 5

Have you tried compiling with optimization options? i.e. -O3?

Question 6

@S.S.Anne his compiler is a web compiler, I don't think it can be optimized. Follow his link to see.

Question 7

Review of your Godbolt-fu: https://godbolt.org/z/doD3Ld

You can change the language from "C++" to "C" via the dropdown in the upper right corner of the source-code-editor pane.
You wanted to be using "gcc (trunk)", not "gcc (modules)", anyway.

The biggest contributor to running time must be that uint8_t digit is being provided as a runtime parameter instead of a compile-time parameter. But your benchmark only ever calls LongNumConstMult with 9, 8, 7, and 3. You should benchmark what happens if you write four different versions of the code: one with static const int digit = 9; at the top, one with static const int digit = 8;, and so on. Maybe that won't meet your design requirements, but it will give you a nice bound on what kind of improvement might be possible.

I infer that maybe you only need to handle 10 different digits. In that case, you could implement the runtime-parameterized LongNumConstMult as

void LongNumConstMult(uint8_t *V, size_t N, uint8_t digit)
{
 switch (digit) {
 case 0: return LongNumSetTo0(V, N);
 case 1: return; // no-op
 case 2: return LongNumConstMult2(V, N);
 case 3: return LongNumConstMult3(V, N);
 [...]
 case 8: return LongNumConstMult8(V, N);
 case 9: return LongNumConstMult9(V, N);
 }
}

I predict that "one branch at the beginning, followed by many constant multiplications in a loop" might well beat "many non-constant multiplications in a loop."

It's fine for the solution to use intrinsics or other specialized techniques

What about making V an array of uint16_t, uint32_t, uint64_t, or even __uint128_t? Even if V remains uint8_t, could you type-pun it to load 8 or 16 bytes at a time and do the multiplication at that width? (What is the native width of your machine?)

Here's some code that's in C++, so not directly applicable to your case, but you might find it useful: https://quuxplusone.github.io/blog/2020/02/13/wide-integer-proof-of-concept/ The code itself uses some x86 compiler intrinsics that may be relevant to your interests.

Question 8

It looks like you reviewed code that was behind a link. It also looks like the question was edited after the answer, potentially invalidating (part of) your answer. I'm not really sure what to make of this, so I'll leave it be for now. Please comment or rollback yourself if there is a problem with the edit(s) on the question. Thanks!

Question 9

Yes, the question's "fastest version so far" didn't look nearly so complicated when I answered. But whatever; I'm certainly not losing sleep over it. @Marc: please don't edit the code in a question after answers have been posted.

Quuxplusone Quuxplusone 19.7k2 gold badges44 silver badges91 bronze badges · Answer 1 · 2020-04-19 17:57:56Z

Review of your Godbolt-fu: https://godbolt.org/z/doD3Ld

You can change the language from "C++" to "C" via the dropdown in the upper right corner of the source-code-editor pane.
You wanted to be using "gcc (trunk)", not "gcc (modules)", anyway.

The biggest contributor to running time must be that uint8_t digit is being provided as a runtime parameter instead of a compile-time parameter. But your benchmark only ever calls LongNumConstMult with 9, 8, 7, and 3. You should benchmark what happens if you write four different versions of the code: one with static const int digit = 9; at the top, one with static const int digit = 8;, and so on. Maybe that won't meet your design requirements, but it will give you a nice bound on what kind of improvement might be possible.

I infer that maybe you only need to handle 10 different digits. In that case, you could implement the runtime-parameterized LongNumConstMult as

void LongNumConstMult(uint8_t *V, size_t N, uint8_t digit)
{
 switch (digit) {
 case 0: return LongNumSetTo0(V, N);
 case 1: return; // no-op
 case 2: return LongNumConstMult2(V, N);
 case 3: return LongNumConstMult3(V, N);
 [...]
 case 8: return LongNumConstMult8(V, N);
 case 9: return LongNumConstMult9(V, N);
 }
}

I predict that "one branch at the beginning, followed by many constant multiplications in a loop" might well beat "many non-constant multiplications in a loop."

It's fine for the solution to use intrinsics or other specialized techniques

What about making V an array of uint16_t, uint32_t, uint64_t, or even __uint128_t? Even if V remains uint8_t, could you type-pun it to load 8 or 16 bytes at a time and do the multiplication at that width? (What is the native width of your machine?)

Here's some code that's in C++, so not directly applicable to your case, but you might find it useful: https://quuxplusone.github.io/blog/2020/02/13/wide-integer-proof-of-concept/ The code itself uses some x86 compiler intrinsics that may be relevant to your interests.

It looks like you reviewed code that was behind a link. It also looks like the question was edited after the answer, potentially invalidating (part of) your answer. I'm not really sure what to make of this, so I'll leave it be for now. Please comment or rollback yourself if there is a problem with the edit(s) on the question. Thanks!
Yes, the question's "fastest version so far" didn't look nearly so complicated when I answered. But whatever; I'm certainly not losing sleep over it. @Marc: please don't edit the code in a question after answers have been posted.

Stack Exchange Network

Multiplying an array by a number in C and how can achive vectorize the principal loop in this function?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Multiplying an array by a number in C and how can achive vectorize the principal loop in this function?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions