I have code in C that multiplies each element of an array by a number (0-9), resulting in a series of base 10 digits.
I compile with
gcc -xc -Ofast -msse2 -flax-vector-conversions -ffast-math -funroll-all-loops --param max-unroll-times=50 -ftree-vectorize -fopt-info-vec-missed
My processor is core i7 950. My problem is that this function takes longer to run that I expected (8 seconds in my faster version). I need it to be faster.
I know that my problem is that
can't vectorize the principal loop for because of this error: not vectorized: not suitable for gather load x_60 = table[_59]
. How can this code be modified to solve this problem and make the code faster? It's fine for the solution to use intrinsics or other specialized techniques.
Compilable code with different tests here.
My fastest version so far is this:
uint8_t ConstMul(uint8_t *V, size_t N, uint8_t digit){
#define TABLE_SIZE ((9*256 + 9)*9 + 9 + 1)
static uint32_t table[TABLE_SIZE];
if(!table[1]){
#pragma simd
for(uint32_t i = 0; i < TABLE_SIZE; ++i){
uint32_t u = i % 256 % 10;
uint32_t d = (i / 256 + i % 256 / 10) % 10;
uint32_t c = (i / 256 + i % 256 / 10) / 10;
table[i] = c | (u << 8)|(d << 16);
}
}
if(N == 0 || digit <= 1){
if(digit == 0) memset(V,0,N);
return 0;
}else{
size_t CARRY = 0;
if((uintptr_t)V & 1){
int R = V[0] * digit + (uint8_t)CARRY;
CARRY = (uint8_t)(R / 10);
V[0] = (uint8_t)(R - CARRY * 10);
V++;
N--;
}
{
uint16_t *V2 = (uint16_t *)(void *)V;
size_t N2 = N / 2;
for(size_t i = 0; i < N2; ++i){
uint32_t x = table[V2[i] * digit + CARRY];
V2[i] = (uint16_t)(x >> 8);
CARRY = (uint8_t)x;
}
}
if(N & 1){
int R = V[N-1]*digit + (uint8_t)CARRY;
CARRY = (uint8_t)(R/10);
V[N-1] = (uint8_t)(R - CARRY * 10);
}
return (uint8_t)CARRY;
}
#undef TABLE_SIZE
}
But I also tried these approaches which were slower:
void ConstMult( uint8_t *V, size_t N, uint8_t digit )
{
uint8_t CARRY = 0;
for ( size_t i=0; i< N; ++i )
{
V[i] = V[i] * digit + CARRY;
CARRY = ((uint32_t)V[i] * (uint32_t)0xCCCD) >> 19;
V[i] -= (CARRY << 3) + (CARRY << 1);
}
}
uint8_t ConstMult( uint8_t *V, size_t N, uint8_t digit )
{
uint8_t CARRY = 0;
for ( int i=0; i< N; i++ )
{
char R = V[i] * digit + CARRY;
CARRY = R / 10;
R = R - CARRY*10;
V[i] = R;
}
return CARRY; // may be from 0 to 9
}
uint8_t ConstMult(uint8_t *V, size_t N, uint8_t digit)
{
uint8_t CARRY = 0;
uint8_t ja = 0;
for (size_t i = 0; i < N; ++i) {
uint8_t aux = V[i] * digit;
uint8_t R = aux + CARRY;
CARRY = ((u_int32_t)R*(u_int32_t)0xCCCD) >> 19;
ja = (CARRY << 3) + 2*CARRY;
R -= ja;
V[i] = R;
}
return CARRY;
}
1 Answer 1
Review of your Godbolt-fu: https://godbolt.org/z/doD3Ld
You can change the language from "C++" to "C" via the dropdown in the upper right corner of the source-code-editor pane.
You wanted to be using "gcc (trunk)", not "gcc (modules)", anyway.
The biggest contributor to running time must be that uint8_t digit
is being provided as a runtime parameter instead of a compile-time parameter. But your benchmark only ever calls LongNumConstMult
with 9
, 8
, 7
, and 3
. You should benchmark what happens if you write four different versions of the code: one with static const int digit = 9;
at the top, one with static const int digit = 8;
, and so on. Maybe that won't meet your design requirements, but it will give you a nice bound on what kind of improvement might be possible.
I infer that maybe you only need to handle 10 different digits. In that case, you could implement the runtime-parameterized LongNumConstMult
as
void LongNumConstMult(uint8_t *V, size_t N, uint8_t digit)
{
switch (digit) {
case 0: return LongNumSetTo0(V, N);
case 1: return; // no-op
case 2: return LongNumConstMult2(V, N);
case 3: return LongNumConstMult3(V, N);
[...]
case 8: return LongNumConstMult8(V, N);
case 9: return LongNumConstMult9(V, N);
}
}
I predict that "one branch at the beginning, followed by many constant multiplications in a loop" might well beat "many non-constant multiplications in a loop."
It's fine for the solution to use intrinsics or other specialized techniques
What about making V
an array of uint16_t
, uint32_t
, uint64_t
, or even __uint128_t
? Even if V
remains uint8_t
, could you type-pun it to load 8 or 16 bytes at a time and do the multiplication at that width? (What is the native width of your machine?)
Here's some code that's in C++, so not directly applicable to your case, but you might find it useful: https://quuxplusone.github.io/blog/2020/02/13/wide-integer-proof-of-concept/ The code itself uses some x86 compiler intrinsics that may be relevant to your interests.
-
\$\begingroup\$ It looks like you reviewed code that was behind a link. It also looks like the question was edited after the answer, potentially invalidating (part of) your answer. I'm not really sure what to make of this, so I'll leave it be for now. Please comment or rollback yourself if there is a problem with the edit(s) on the question. Thanks! \$\endgroup\$2020年04月20日 18:48:18 +00:00Commented Apr 20, 2020 at 18:48
-
\$\begingroup\$ Yes, the question's "fastest version so far" didn't look nearly so complicated when I answered. But whatever; I'm certainly not losing sleep over it. @Marc: please don't edit the code in a question after answers have been posted. \$\endgroup\$Quuxplusone– Quuxplusone2020年04月20日 19:14:08 +00:00Commented Apr 20, 2020 at 19:14
-O3
? \$\endgroup\$