Reusing something unrelated as a hardware adder for code optimization

Question 1

I am running into some speed issues in the ISR. Due to the nature of my code, I need to add 20 values (16bit) in an ISR. The code looks like this

 #define Element_size 4
 #define Chunk_size 5
 void add(void)
 { 
 int i;
 int j;
 for (i=0;i<Element_Size;i++)
 { 
 sum[i]=0;
 for (j=0;i<Chunk_Size;j++)
 {
 sum[i]+=data[j*4+i];
 }
 }
 }

This is not the exact code but the algo is identical. (data is laid out as chunks with size of 4, in this case I have 5 chunks of data. I add the first element of every chunk, later the second element etc. etc.) Assume sum and data are globals. This takes longer time than I have. I am looking for a solution to speed things up. So far, I can think of:

Improve the algo: I can make this one turn (or without a loop) instead of two but if the chunk sizes change (now a compile time option), it will break. I will do this as last resort once the code is final and no other changes expected.
Move to assembly: I will do this but it will take time, a viable option
Make the function inline: This will save some time but not a whole lot as the majority of the time is spent during summation, I will do this anyway.
Use some hardware block for summation: I am using STM32F2 family (Arm Cortex M3), it has several good stuff in there but no ALU. I looked at the crc, hash and cyrpto libs but couldn't find anything that jumps out that can be used as a hardware adder.

Especially, if one of the masters among you suggest a hardware assisted way to do this calculation, I would be in debt. I am looking an out of the box way of doing this, not just code optimization. This is really the key question I have.

Question 2

Without knowing what you are running on and what you are compiling with, my guess is that the "j*4" may be the slowest part of the loop. If it is doing the multiplication in software it will be slow. If the compiler isn't optimizing it, you could try using "j<<2" instead, which shifts left twice, thus doing the same thing as multiplying by 4. On the dsPIC chips I almost exclusively use the DSPs math routines, and my god is it faster than using * and /!

Question 3

Some more details might help: Is the optimiser on? Can you post the assembler code the compiler produces? What clock rate? How long have you got to do this in - are you missing the deadline by loads, or just a bit? What type of memory are the samples and the sums stored in?

Question 4

As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).

When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.

I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.

As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.

You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.

Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.

Question 5

Pretty reasonable answer. One more thing that could be done, skip the multiplication completely by incrementing j by 4 instead of 1. for (j=0;j<Chunk_Size*4;j+=4)

Question 6

Even that short loop already looks like far too much for a first-level interrupt handler. Perhaps one of these two approaches will help:

partial update

Where do these numbers you are summing come from? Is there some way you could somehow update the sum to the correct value without actually running through all 20 values every time?

I.e., something like

#define Element_size 4
#define Chunk_size 5
void update( int x, int y, int new_value ){
 int old_value = data[x*Element_size+y];
 sum[y] -= old_value;
 sum[y] += new_value;
 data[x*Element_size+y] = new_value;
}

second-level interrupt handler

Is there some way you could do as little as possible in the first-level interrupt handler, and then later do the sum in the second-level interrupt handler (perhaps in the background "main loop"?) The Wikipedia: "interrupt handler" and Wikibooks: Embedded Control Systems Design/Design Patterns mentions moving as much as possible from the first-level interrupt handler to the second-level interrupt handler.

Question 7

these are good suggestions but wouldn't work out since in the same ISR, I make a life and death decision and that needs to be hard realtime. When this ISR is running, I stop everything, nothing can preempt this and this should deliver all the time.

score 4 · Accepted Answer · 2011-08-24 17:35:22Z

As a first step I would look at the generated assembly so see what takes the most time. Next determine how much time you have available, and make a rough estimate whether the most optimized solution (assembly, both loops unrolled) can be fast enough. If not,look voor a very different solution (faster chip, different algorithm, make the code that changes the arrays do precompute work like davidcarry suggested, etc).

When unrolling you can at least include an assertion that the #define's are still the value you assumed when you unrolled. But using reasonable macro language unrolling can probably be done by a macro that takes the #defines's into account. BUT after each chance of the #define's you would have to re-test whether you fulfill your timeconstraints, so maybe you don't want automatic adaptation at all.

I don't understand why you are looking for a "can be used as a hardware adder." Your CPU can add two 32 bit values as fast as you can supply them! Which hints at a possible optimization: pack two 16-bit values in a 32-bit word. If you can guarantee that the 16-bit sum never overflows this might speed you up by a factor of 2. Maybe more, this is a 32-bit processor that is probably more comfortable with accessing 32-bit data instead of 16-bit data. Even without the packing trick it might be a good idea to try using 32-bit integers to speed things up.

As for the unrolling: unrolling the inner loop will be more effective than unrolling the outer loop. You might swap the loops, then unroll for the element size, and keep the for() for the chunk size. This way j*4 can be computed once.

You use the x[ i ] style of acessing array elements. I don't know how clever your compiler is, if it takes your index expressions literally it will generate much more efficient code if you use the *p++ style. But again: first check what the compiler does, otherwise you might be doing work for nothing.

Assembler versus C: Never underestimate a compiler. Before you try assembly, first check what the compiler makes, and think why. Better give the compiler every opportunity to optimize than do it yourself in assembly.

Pretty reasonable answer. One more thing that could be done, skip the multiplication completely by incrementing j by 4 instead of 1. for (j=0;j<Chunk_Size*4;j+=4)

Stack Exchange Network

Reusing something unrelated as a hardware adder for code optimization

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Reusing something unrelated as a hardware adder for code optimization

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions