Code Optimization of IF Statements?

Question 1

Considering the design limitation of hardware (clock speed, memory, etc...), is there any better way to write IF statements more efficiently in C language (e.g. run on Arduino), more specifically the following piece of code of an embedded software?

if (A >= 476 && A <= 524) A = 500;
else if (A > 524 && A <= 860) A = (A * 1.4) - 760;
else if (A > 136 && A < 476) A = (A * 1.4) - 685;
else if (A > 860) A = 1000;
else if (A <= 136) A = 100;

A is passed as value by a sub-routine and this value is then transformed into a different value according to its position in a set range (of values). It has to run as fast as possible!

UPDATE!

A is a variable of integer type i.e. int

SUMMARY:
I summarized here the options I learned from you:

a) improve structure

consider first comparison for most probable inputs;
eliminate && operators.

b) eliminate floating point calculation
c) create look up tables
d) assign more adequate variable type
e) write code in abbreviated form (*= , += , -= )

a) Indeed, I already put the comparison 'if (A>= 476 && A <= 524)' first, because I expect most of the values passed by the subroutine to be within the first range. So most of the time the program doesn't execute the next if statements.

b) I reviewed the floating point calculation and consider it a bad choice for this piece of code. It definitely can be avoided by using a divider of integers.

c) The lookup table seems a good idea but not for this application, because the program has at least 3 more of such clustered comparisons. Memory might become a problem.

d) The variable was assigned as int although it would never really become negative. Unsigned int is what I assume is better. But I would say it doesn't make the code faster.

e) Writing the code in abbreviated form seems also another improvement idea. I just don't know for sure how different compilers might interpret that or if it really saves so much time.

Thank you.

Question 2

you can very well re-structure your code to look more logical. if you've checked for A not being smaller than 476, you can drop half of your conditions instantly.

Question 3

if you are worried about speed on an arduino dont use floating point (esp double precision). all of the rest of the optimizations if any pale by comparison.

Question 4

you can help perhaps by re-ordering but the compilers are pretty good as well at figuring things out. look at the compiler output and adjust and look and adjust, every so often time it because fewer instructions doesnt necessarily mean faster.

Question 5

if you were truly after speed a lookup table would apply here, your numbers are all even and the largest is 860 so a 430 deep look up table, could then do a jump table with the output of that. wait you have a greater than 860 so I dont know how high A can go as to how deep the table needs to be.

Question 6

"It has to run as fast as possible!" - why?

Question 7

The floating point is the killer here. It appears to be doubling the amount of code generated, not that number of instructions is necessarily slower.

It has to make a number of calls to what I assume is a float to int int to float. Interestingly changing from double float (1.4) to single float (1.4F) didnt change the overall result, although again the code is making a lot of calls (I assume to float conversions or operations).

So even the times 7 has to be synthesized.

unsigned int fun ( unsigned int A )
{
A = A*7;
return(A);
}
00000000 <fun>:
 0: 28 2f mov r18, r24
 2: 39 2f mov r19, r25
 4: 22 0f add r18, r18
 6: 33 1f adc r19, r19
 8: 22 0f add r18, r18
 a: 33 1f adc r19, r19
 c: 22 0f add r18, r18
 e: 33 1f adc r19, r19
 10: 42 2f mov r20, r18
 12: 53 2f mov r21, r19
 14: 48 1b sub r20, r24
 16: 59 0b sbc r21, r25
 18: 84 2f mov r24, r20
 1a: 95 2f mov r25, r21
 1c: 08 95 ret

so one might try this as an alternate.

unsigned int fun ( unsigned int A )
{
A = (A<<2)+(A<<1)+A;
return(A);
}
00000000 <fun>:
 0: 28 2f mov r18, r24
 2: 39 2f mov r19, r25
 4: 22 0f add r18, r18
 6: 33 1f adc r19, r19
 8: 22 0f add r18, r18
 a: 33 1f adc r19, r19
 c: 48 2f mov r20, r24
 e: 59 2f mov r21, r25
 10: 44 0f add r20, r20
 12: 55 1f adc r21, r21
 14: 24 0f add r18, r20
 16: 35 1f adc r19, r21
 18: 82 0f add r24, r18
 1a: 93 1f adc r25, r19
 1c: 08 95 ret

same number of instructions. to see this easier

unsigned char fun ( unsigned char A )
{
A = A * 7;
return(A);
}
00000000 <fun>:
 0: 98 2f mov r25, r24
 2: 99 0f add r25, r25
 4: 99 0f add r25, r25
 6: 99 0f add r25, r25
 8: 98 1b sub r25, r24
 a: 89 2f mov r24, r25
 c: 08 95 ret

and using shifting and adding

00000000 <fun>:
 0: 98 2f mov r25, r24
 2: 99 0f add r25, r25
 4: 99 0f add r25, r25
 6: 28 2f mov r18, r24
 8: 22 0f add r18, r18
 a: 92 0f add r25, r18
 c: 89 0f add r24, r25
 e: 08 95 ret

Duh, right A*7 = (A<<3)-1;

00000000 <fun>:
 0: 98 2f mov r25, r24 r25 = A
 2: 99 0f add r25, r25 r25 = A + A = 2A
 4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
 6: 99 0f add r25, r25 r25 = 4A + 4A = 8A
 8: 98 1b sub r25, r24 r25 = 8A - A = 7A
 a: 89 2f mov r24, r25 return 7A
 c: 08 95 ret

The (A<<2)|(A<<1)+A source gives this

00000000 <fun>:
 0: 98 2f mov r25, r24 r25 = A
 2: 99 0f add r25, r25 r25 = A + A = 2A
 4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
 6: 28 2f mov r18, r24 r18 = A
 8: 22 0f add r18, r18 r18 = A + A = 2A
 a: 92 0f add r25, r18 r15 = 4A + 2A = 6A
 c: 89 0f add r24, r25 return (6A + A)
 e: 08 95 ret

But they missed a possible optimization

 mov r25, r24 r25 = A
 add r25, r25 r25 = A + A = 2A
 mov r18, r25 r18 = 2A
 add r25, r25 r25 = 2A + 2A = 4A
 add r25, r18 r15 = 4A + 2A = 6A
 add r24, r25 return (6A + A)
 ret

same number of instructions at least, ideally execute the same. Looking at these trip points

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

A is obviously not an 8 bit number is at least 16 bit, if it is a double or a float I give up, game over. Assuming 16 bits...

Here are the constants in hex.

136 0x88
476 0x1dc
524 0x20C
860 0x35C

Looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 1;
else if ( A < 476 ) A=2;
else if ( A <= 524 ) A=3;
else A= 4;
return(A);
}
00000000 <fun>:
 0: 89 38 cpi r24, 0x89 ; 137
 2: 91 05 cpc r25, r1
 4: 00 f0 brcs .+0 ; 0x6 <fun+0x6>
 6: 8c 3d cpi r24, 0xDC ; 220
 8: 21 e0 ldi r18, 0x01 ; 1
 a: 92 07 cpc r25, r18
 c: 00 f4 brcc .+0 ; 0xe <fun+0xe>
 e: 82 e0 ldi r24, 0x02 ; 2
 10: 90 e0 ldi r25, 0x00 ; 0
 12: 08 95 ret
 14: 8d 30 cpi r24, 0x0D ; 13
 16: 92 40 sbci r25, 0x02 ; 2
 18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
 1a: 83 e0 ldi r24, 0x03 ; 3
 1c: 90 e0 ldi r25, 0x00 ; 0
 1e: 08 95 ret
 20: 81 e0 ldi r24, 0x01 ; 1
 22: 90 e0 ldi r25, 0x00 ; 0
 24: 08 95 ret
 26: 84 e0 ldi r24, 0x04 ; 4
 28: 90 e0 ldi r25, 0x00 ; 0
 2a: 08 95 ret

does this help

 unsigned int fun ( unsigned int A )
 {
 unsigned char B = A>>2;
 if ( B <= (136>>2) ) A= 1;
 else if ( B < (476>>2) ) A=2;
 else if ( B <= (524>>2) ) A=3;
 else A= 4;
 return(A);
 }
00000000 <fun>:
 0: 96 95 lsr r25
 2: 87 95 ror r24
 4: 96 95 lsr r25
 6: 87 95 ror r24
 8: 83 32 cpi r24, 0x23 ; 35
 a: 00 f0 brcs .+0 ; 0xc <fun+0xc>
 c: 87 37 cpi r24, 0x77 ; 119
 e: 00 f4 brcc .+0 ; 0x10 <fun+0x10>
 10: 82 e0 ldi r24, 0x02 ; 2
 12: 90 e0 ldi r25, 0x00 ; 0
 14: 08 95 ret
 16: 84 38 cpi r24, 0x84 ; 132
 18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
 1a: 83 e0 ldi r24, 0x03 ; 3
 1c: 90 e0 ldi r25, 0x00 ; 0
 1e: 08 95 ret
 20: 81 e0 ldi r24, 0x01 ; 1
 22: 90 e0 ldi r25, 0x00 ; 0
 24: 08 95 ret
 26: 84 e0 ldi r24, 0x04 ; 4
 28: 90 e0 ldi r25, 0x00 ; 0
 2a: 08 95 ret

ehh, man this instruction set is painful.

so looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7);
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7);
else A= 1000;
return(A);
}

the compiler is smart enough to lump the A*7's together.

Like this though

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7) - 760;
else A= 1000;
return(A);
}

It does not combine the A*7s note the 16 bit A*7 looks like this

 30: 28 2f mov r18, r24
 32: 39 2f mov r19, r25
 34: 22 0f add r18, r18
 36: 33 1f adc r19, r19
 38: 22 0f add r18, r18
 3a: 33 1f adc r19, r19
 3c: 22 0f add r18, r18
 3e: 33 1f adc r19, r19
 40: 42 2f mov r20, r18
 42: 53 2f mov r21, r19
 44: 48 1b sub r20, r24
 46: 59 0b sbc r21, r25
 48: 84 2f mov r24, r20
 4a: 95 2f mov r25, r21
 4c: 08 95 ret

for ease of reading and coding the comparisons in order

if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

Might shave some instructions for the worst case path, but if your data happens to fall in a bell curve of some sort, most of the samples are say between 500 and 600, then you want to have those ranges compared first then let the others fall through. it may or may not be worth trying to get the ranges that end with A*1.4 as the last two, so that before those compares you do a single B = A * 1.4. you do the compare of one of the ranges then either subtract A = B - 685 or A = B - 760, saving some codes space, not necessarily saving on performance because with an if-then-else tree like this re-ordering, etc is not going to save you much, some paths are shorter some are longer, you may add a few or save a few instructions here and there, but it is still a pretty long path for this code on this architecture.

I didnt do a full link with a library, but looking at readelf:

11: 00000000 0 NOTYPE GLOBAL DEFAULT UND __floatunsisf
12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __mulsf3
13: 00000000 0 NOTYPE GLOBAL DEFAULT UND __subsf3
14: 00000000 0 NOTYPE GLOBAL DEFAULT UND __fixunssfsi

Those floating point libraries are going to kill you and that answers something, even though the C standard says that A * 1.4 is a double operation the compiler is clearly doing single float, ignoring the standard.

Changing the A * 1.4 to A * 7 / 5 implements the long string of adds to implement the times 7 and does the divide with a gcc library

12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __udivmodhi4

That is going to save you on overall code space and should save on execution when you would have normally had to deal with generic floating point libraries.

Man, I didnt realize you were using signed numbers.

else if ( A > 524 ) A= (A * 7 / 5) - 760;
else if ( A > 136 ) A= (A * 7 / 5) - 685;

both result in negative numbers, I assume coming in it is positive and going out signed?

Okay as I expected this actually gets us somewhere

unsigned int fun ( unsigned int A )
{
 unsigned char B;
 B = (A>>8)&0xFF;
 if(B==0) A = 7;
 if(B==1) A = 18;
 if(B==2) A = 307;
 if(B==3) A = 211;
 return(A);
}
 0: 99 23 and r25, r25
 2: 01 f0 breq .+0 ; 0x4 <fun+0x4>
 4: 91 30 cpi r25, 0x01 ; 1
 6: 01 f4 brne .+0 ; 0x8 <fun+0x8>
 8: 82 e1 ldi r24, 0x12 ; 18
 a: 90 e0 ldi r25, 0x00 ; 0
 c: 08 95 ret

we know that it is using a register for the upper half and a register for the lower half so the shift of 8 with or without the and for completeness simply tells the compiler just use the upper 8 bit register.

Now I am assuming an unsigned incoming and signed outgoing since you dont really have anything in the negative range coming in except for the everything less than or equal to 136.

if you make a table of all the answers. (it comes out to 679 unique answers, if you could only limit it to that).

Ugh, nothing is nice about these numbers.

If your input is unsigned which I suspect it is not, you could look at the upper byte using the B=A>>8 thing and if it is greater than 4 then return 1000. And that only really saves you one 16 bit compare becoming an 8 bit compare with this instruction set and the nature of your numbers.

The next optimization is get rid of the floating point and replace it with (A*7)/5. That is the number one thing you can do with this code to make it faster.

If you have a range of values you expect more than others, put those ranges first and it has to hop through less code for the more common values. If it is an even distribution then either of jonks if then else trees makes it easier to read and might save an instruction here and there. The compiler may still move the tree around to suit its needs or an optimization it may think it sees.

The killers are the two ranges that have the math, if you are willing to burn 960 bytes of program space you can pre-build look up tables for these

extern unsigned int LUTONE[];
extern unsigned int LUTTWO[]
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= LUTONE[A-477];
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= LUTTWO[A-860];
else A= 1000;
 return(A);
}

The lookups look something like this

 30: 88 0f add r24, r24
 32: 99 1f adc r25, r25
 34: e8 2f mov r30, r24
 36: f9 2f mov r31, r25
 38: e0 50 subi r30, 0x00 ; 0
 3a: f0 40 sbci r31, 0x00 ; 0
 3c: 80 81 ld r24, Z
 3e: 91 81 ldd r25, Z+1 ; 0x01
 40: 08 95 ret
 42: 88 0f add r24, r24
 44: 99 1f adc r25, r25
 46: e8 2f mov r30, r24
 48: f9 2f mov r31, r25
 4a: e0 50 subi r30, 0x00 ; 0
 4c: f0 40 sbci r31, 0x00 ; 0
 4e: 80 81 ld r24, Z
 50: 91 81 ldd r25, Z+1 ; 0x01

which is not much more math than the A*7 the loads on a platform like this are dependent on the speed of the flash relative to the clock it is possibly a single cycle per if not then probably two. But the instruction fetches plus the code to do the divide by 5 with a generic division library for 16 bit numbers should be way way way more costly speed wise. Resource consumption-wise the float libraries are probably smaller than the 900 some bytes, maybe, you can try it and find out. If you are doing float elsewhere then you are burning that flash anyway but that doesnt excuse using float on an AVR esp in code you want to go fast, so still do the integer math if you dont use a look up table.

Short answer get rid of the A * 1.4 and replace with (A * 7) / 5. If you know the data is going to be heavy in certain ranges more than others put those ranges first for performance, otherwise make it more readable with one of jonk's if-then-else trees: The smallest range to largest or largest to smallest. You might save an instruction or two on the deeper cases.

Question 8

@tozheneznayu - (A * 7)/5 is more expensive than a table look up from flash. It is common in Arduino projects to have quite a lot of flash available, so the tables may fit.

Question 9

@tozheneznayu you can/should do some experiments, build with float and build with integer math, does the binary (not the elf file or whatever but the actual flash binary image) change sizes much? Create a test using a timer that calls your function with a value in each of the ranges many (thousands) times to time the duration, divide by the number of times you called it and get a rough idea of how fast/slow, rearrange the if/then/else tree or change between fixed point and floating point math and see how that changes.

Question 10

Each time one looks at assembler like this, one should think "Why on earth am I not using a 32 bit MCU for?"

Question 11

Instead of (A * 7)/5 one may want to try (A * 14)/10. Last time I checked, gcc did an amazing job on an integer divide-by-10; probably because that's a much-used operation.

Question 12

its the same with the divide by a power of two taken out. if they can divide by 10 well they can divide by 5 even better, more accurate if you are using the multiply by 1/5th thing, shift then multiply by 1/5th instead of multiply by 10th. if doing long division then the maybe 10 is faster sure, have to think about it.

Question 13

Ordering of the conditions is the key here. This way the conditions "accumulate" from top to bottom and each subsequent else if is executed only if all of the previous conditions are false. So you get the (not) AND operation for free.

if (A <= 136) 
 A = 100;
else if (A < 476)
 A = (A * 1.4) - 685;
else if (A <= 524)
 A = 500;
else if (A <= 860)
 A = (A * 1.4) - 760;
else
 A = 1000;

** But as a side note: I would not worry here about code optimization, as it is insignificant in this case, but about making it readable..

Question 14

DonaldKnuth: premature optimization is the root of all evil

Question 15

this ain't premature optimization.

Question 16

@robertbristow-johnson Any optimization that's not proven to be necessary is premature. Also, the cost of the floating point math dwarfs cost of the flow control, so it's pretty much pointless (not quite the same as premature, but still).

Question 17

@marcelm, this code refactoring is not premature. it should be done immediately. called it "optimization" if you want, but clean code is always better than shitty code.

Question 18

If the motivation is to have clean, readable code, yes. But if the motivation is performance (as the OP said it was), I still maintain it qualifies as useless and/or premature.

Question 19

Marcus pretty much tells you what to consider. You can consider two symmetrical but otherwise identical ways to improve the code.

(I don't think C compilers currently do the optimization needed on their own, though I've seen research papers on compiler theory that would take care of your situation... I just don't know of any compilers that implement the research.)

These are:

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

and,

if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;

Fewer explicit comparisons by ordering things. You get to choose what makes more sense to read.

You may, or may not, find that a compiler will do something different with:

A *= 1.4, A -= 760;

than it does with,

A= (A * 1.4) - 760;

You could experiment. I've seen dumb things done either way and smart things done either way, too. So what the heck.

Because of the multiplication by 1.4, the compiler is going to promote things to floating point and back again if your variable A is an integer. You would be MUCH BETTER off working this out using integer methods. In fact, if A is declared as an integer type of some kind, then it is the floating point conversions and operations that are killing you. Not the logic. You seriously need to reconsider that multiplication, if so.

So... tell me. Is A in integer? If so, I'll edit my answer and give you a better approach, one that is much much faster.

EDIT: Just as a sampler (and that doesn't mean all that is doable), if A is an integer your two floating point calculations can be replaced by:

\$A = \left(1.4 \cdot A \right) - 760 \rightarrow \frac{7\cdot A - 3800}{5}\$
\$A = \left(1.4 \cdot A \right) - 685 \rightarrow \frac{7\cdot A - 3425}{5}\$

Both of these examples make the assumption that you can actually compute the numerator within the integer range, of course. But as your calculations are already ranged well, I am pretty sure that 16-bit integer values will succeed quite well with the above transformations. (Rounding questions do come up at this point, but that is a very minor detail to resolve and it doesn't materially change the above.)

But even that isn't the final word. More improvements can be found, as well.

Question 20

i think you got the 685 and the 760 swapped in both of your ordered code above.

Question 21

@robertbristow-johnson: Thanks. I probably did. Will fix.

Question 22

On a typical low-end MCU like the AVR, even upto low-end Cortex-M0, flash and SRAM are similar speed, and can be accessed in one or two cycles; there is no cache or SDRAM latency to complicate the system.

In these MCUs, if you really want that to go as fast as practical, and have expensive operations that can be pre-calculated at compile time, then calculate everything that can be calculated at compile time, store the values in an array, and use A as the index of the array.

A float multiply is very expensive, so doing that at compile time will have an enormous impact on run-time. Even in this specific case, where * 1.4 can be replaced by * 7 / 5, it is still expensive compared to array indexing. Of course, in the general case, it may be difficult to replace a float calculation with a cheaper integer calculation, so using look-up tables is always worth considering.

In this case, the code will use only two if statements to protect the array indexing, for example:

#include <avr/pgmspace.h>
#define calc_137_475(a) ((a * 1.4) - 685) // more expensive than array indexing
#define calc_525_860(a) ((a * 1.4) - 760) // more expensive than array indexing
const PROGMEM int16_t results[] = { 
 100, 100, 100, // ... 137 of these
 calc_137_475(137), calc_137_475(138), calc_137_475(139), // ... etc
 500, 500, 500, // .. etc
 calc_525_860(525), calc_524_860(526), calc_524_860(527), // ... the other values ...
};
void setup() {
 // put your setup code here, to run once:
}
void loop() {
 // other stuff
 int A = analogRead(A1); // for example
 if (A>860) {
 A = 1000;
 } else if (A >= 0) {
 A = pgm_read_word_near(results + A);
 }
}

This will consume about 1722 bytes of flash, and none of the MCUs precious RAM. It will be quicker for all cases that involve any of those calculations. Actual gains depend on the complexity of the calculations and on the distribution of initial values of A.

Further, if A could be guaranteed to be positive, by making A unsigned, one if could be replaced with a if (A <= 136) A = 100, which would increase the ratio of calculated to constant values. The value assigned to A is constant, and hence gets no benefit from pre-calculation.

Edit:

void loop() {
 // other stuff
 unsigned int A = analogRead(A1); // note *unsigned*
 if (A>860) {
 A = 1000;
 } else if (A <= 136) {
 A = 100;
 } else {
 A = pgm_read_word_near(results + A);
 }
}

So this has a better ratio of cached-calculation to constant values than the 'all cached' version. However, it still uses the same amount of flash to avoid using a more complex calculation, and hence more instructions, for the array offset.

The expense of using an array offset calculation for the constant in
if (A >= 476 && A <= 524) A = 500;
is offset by removing two tests, and replacing with the address calculation, so is comparable.

On an AVR, a byte load from program memory is one cycle slower than loading from SRAM. So if the table were small enough, and there were enough spare SRAM, using SRAM instead of flash would give a tiny improvement on table access. However, an SRAM table has to be initialised by copying from program flash, so their is a cost to be balanced.

Side Note: Unless this code is almost everything in the 'inner loop', it would be better to measure the whole 'inner loop' rather than focus on a single step.

Typically programmers are very poor at identifying bottlenecks without detailed measurement. Worse, it is easy to waste time focusing on optimisation. Specifically, this part might not be the bottleneck, or their may be no bottleneck. Worse, optimised code can become quite fragile, and small changes in the required behaviour can make optimisations be wrong.

So it is always worth getting all of the program working correctly before attempting any optimisations, base optimisation effort on meaningful measurements of the actual program, and bearing in mind, Michael A Jackson's advice on program optimisation

The First Rule of Program Optimisation: Don't do it.

The Second Rule of Program Optimisation (for experts only!): Don't do it yet.

Post-note: This specific question may be a great example of programmers looking at the wrong performance issue; the if tests are much less costly than the floating point arithmetic. So the basis of the question was focused on the wrong place. I'd take that as an indication that the actual performance issues are likely elsewhere.

old_timer old_timerold_timer 8,37327 silver badges33 bronze badges · Accepted Answer · 2016-08-24 01:59:47Z

The floating point is the killer here. It appears to be doubling the amount of code generated, not that number of instructions is necessarily slower.

It has to make a number of calls to what I assume is a float to int int to float. Interestingly changing from double float (1.4) to single float (1.4F) didnt change the overall result, although again the code is making a lot of calls (I assume to float conversions or operations).

So even the times 7 has to be synthesized.

unsigned int fun ( unsigned int A )
{
A = A*7;
return(A);
}
00000000 <fun>:
 0: 28 2f mov r18, r24
 2: 39 2f mov r19, r25
 4: 22 0f add r18, r18
 6: 33 1f adc r19, r19
 8: 22 0f add r18, r18
 a: 33 1f adc r19, r19
 c: 22 0f add r18, r18
 e: 33 1f adc r19, r19
 10: 42 2f mov r20, r18
 12: 53 2f mov r21, r19
 14: 48 1b sub r20, r24
 16: 59 0b sbc r21, r25
 18: 84 2f mov r24, r20
 1a: 95 2f mov r25, r21
 1c: 08 95 ret

so one might try this as an alternate.

unsigned int fun ( unsigned int A )
{
A = (A<<2)+(A<<1)+A;
return(A);
}
00000000 <fun>:
 0: 28 2f mov r18, r24
 2: 39 2f mov r19, r25
 4: 22 0f add r18, r18
 6: 33 1f adc r19, r19
 8: 22 0f add r18, r18
 a: 33 1f adc r19, r19
 c: 48 2f mov r20, r24
 e: 59 2f mov r21, r25
 10: 44 0f add r20, r20
 12: 55 1f adc r21, r21
 14: 24 0f add r18, r20
 16: 35 1f adc r19, r21
 18: 82 0f add r24, r18
 1a: 93 1f adc r25, r19
 1c: 08 95 ret

same number of instructions. to see this easier

unsigned char fun ( unsigned char A )
{
A = A * 7;
return(A);
}
00000000 <fun>:
 0: 98 2f mov r25, r24
 2: 99 0f add r25, r25
 4: 99 0f add r25, r25
 6: 99 0f add r25, r25
 8: 98 1b sub r25, r24
 a: 89 2f mov r24, r25
 c: 08 95 ret

and using shifting and adding

00000000 <fun>:
 0: 98 2f mov r25, r24
 2: 99 0f add r25, r25
 4: 99 0f add r25, r25
 6: 28 2f mov r18, r24
 8: 22 0f add r18, r18
 a: 92 0f add r25, r18
 c: 89 0f add r24, r25
 e: 08 95 ret

Duh, right A*7 = (A<<3)-1;

00000000 <fun>:
 0: 98 2f mov r25, r24 r25 = A
 2: 99 0f add r25, r25 r25 = A + A = 2A
 4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
 6: 99 0f add r25, r25 r25 = 4A + 4A = 8A
 8: 98 1b sub r25, r24 r25 = 8A - A = 7A
 a: 89 2f mov r24, r25 return 7A
 c: 08 95 ret

The (A<<2)|(A<<1)+A source gives this

00000000 <fun>:
 0: 98 2f mov r25, r24 r25 = A
 2: 99 0f add r25, r25 r25 = A + A = 2A
 4: 99 0f add r25, r25 r25 = 2A + 2A = 4A
 6: 28 2f mov r18, r24 r18 = A
 8: 22 0f add r18, r18 r18 = A + A = 2A
 a: 92 0f add r25, r18 r15 = 4A + 2A = 6A
 c: 89 0f add r24, r25 return (6A + A)
 e: 08 95 ret

But they missed a possible optimization

 mov r25, r24 r25 = A
 add r25, r25 r25 = A + A = 2A
 mov r18, r25 r18 = 2A
 add r25, r25 r25 = 2A + 2A = 4A
 add r25, r18 r15 = 4A + 2A = 6A
 add r24, r25 return (6A + A)
 ret

same number of instructions at least, ideally execute the same. Looking at these trip points

if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

A is obviously not an 8 bit number is at least 16 bit, if it is a double or a float I give up, game over. Assuming 16 bits...

Here are the constants in hex.

136 0x88
476 0x1dc
524 0x20C
860 0x35C

Looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 1;
else if ( A < 476 ) A=2;
else if ( A <= 524 ) A=3;
else A= 4;
return(A);
}
00000000 <fun>:
 0: 89 38 cpi r24, 0x89 ; 137
 2: 91 05 cpc r25, r1
 4: 00 f0 brcs .+0 ; 0x6 <fun+0x6>
 6: 8c 3d cpi r24, 0xDC ; 220
 8: 21 e0 ldi r18, 0x01 ; 1
 a: 92 07 cpc r25, r18
 c: 00 f4 brcc .+0 ; 0xe <fun+0xe>
 e: 82 e0 ldi r24, 0x02 ; 2
 10: 90 e0 ldi r25, 0x00 ; 0
 12: 08 95 ret
 14: 8d 30 cpi r24, 0x0D ; 13
 16: 92 40 sbci r25, 0x02 ; 2
 18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
 1a: 83 e0 ldi r24, 0x03 ; 3
 1c: 90 e0 ldi r25, 0x00 ; 0
 1e: 08 95 ret
 20: 81 e0 ldi r24, 0x01 ; 1
 22: 90 e0 ldi r25, 0x00 ; 0
 24: 08 95 ret
 26: 84 e0 ldi r24, 0x04 ; 4
 28: 90 e0 ldi r25, 0x00 ; 0
 2a: 08 95 ret

does this help

 unsigned int fun ( unsigned int A )
 {
 unsigned char B = A>>2;
 if ( B <= (136>>2) ) A= 1;
 else if ( B < (476>>2) ) A=2;
 else if ( B <= (524>>2) ) A=3;
 else A= 4;
 return(A);
 }
00000000 <fun>:
 0: 96 95 lsr r25
 2: 87 95 ror r24
 4: 96 95 lsr r25
 6: 87 95 ror r24
 8: 83 32 cpi r24, 0x23 ; 35
 a: 00 f0 brcs .+0 ; 0xc <fun+0xc>
 c: 87 37 cpi r24, 0x77 ; 119
 e: 00 f4 brcc .+0 ; 0x10 <fun+0x10>
 10: 82 e0 ldi r24, 0x02 ; 2
 12: 90 e0 ldi r25, 0x00 ; 0
 14: 08 95 ret
 16: 84 38 cpi r24, 0x84 ; 132
 18: 00 f4 brcc .+0 ; 0x1a <fun+0x1a>
 1a: 83 e0 ldi r24, 0x03 ; 3
 1c: 90 e0 ldi r25, 0x00 ; 0
 1e: 08 95 ret
 20: 81 e0 ldi r24, 0x01 ; 1
 22: 90 e0 ldi r25, 0x00 ; 0
 24: 08 95 ret
 26: 84 e0 ldi r24, 0x04 ; 4
 28: 90 e0 ldi r25, 0x00 ; 0
 2a: 08 95 ret

ehh, man this instruction set is painful.

so looking at this

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7);
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7);
else A= 1000;
return(A);
}

the compiler is smart enough to lump the A*7's together.

Like this though

unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 7) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 7) - 760;
else A= 1000;
return(A);
}

It does not combine the A*7s note the 16 bit A*7 looks like this

 30: 28 2f mov r18, r24
 32: 39 2f mov r19, r25
 34: 22 0f add r18, r18
 36: 33 1f adc r19, r19
 38: 22 0f add r18, r18
 3a: 33 1f adc r19, r19
 3c: 22 0f add r18, r18
 3e: 33 1f adc r19, r19
 40: 42 2f mov r20, r18
 42: 53 2f mov r21, r19
 44: 48 1b sub r20, r24
 46: 59 0b sbc r21, r25
 48: 84 2f mov r24, r20
 4a: 95 2f mov r25, r21
 4c: 08 95 ret

for ease of reading and coding the comparisons in order

if ( A > 860 ) A= 1000;
else if ( A > 524 ) A= (A * 1.4) - 760;
else if ( A >= 476 ) A= 500;
else if ( A > 136 ) A= (A * 1.4) - 685;
else A= 100;
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= (A * 1.4) - 685;
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= (A * 1.4) - 760;
else A= 1000;

Might shave some instructions for the worst case path, but if your data happens to fall in a bell curve of some sort, most of the samples are say between 500 and 600, then you want to have those ranges compared first then let the others fall through. it may or may not be worth trying to get the ranges that end with A*1.4 as the last two, so that before those compares you do a single B = A * 1.4. you do the compare of one of the ranges then either subtract A = B - 685 or A = B - 760, saving some codes space, not necessarily saving on performance because with an if-then-else tree like this re-ordering, etc is not going to save you much, some paths are shorter some are longer, you may add a few or save a few instructions here and there, but it is still a pretty long path for this code on this architecture.

I didnt do a full link with a library, but looking at readelf:

11: 00000000 0 NOTYPE GLOBAL DEFAULT UND __floatunsisf
12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __mulsf3
13: 00000000 0 NOTYPE GLOBAL DEFAULT UND __subsf3
14: 00000000 0 NOTYPE GLOBAL DEFAULT UND __fixunssfsi

Those floating point libraries are going to kill you and that answers something, even though the C standard says that A * 1.4 is a double operation the compiler is clearly doing single float, ignoring the standard.

Changing the A * 1.4 to A * 7 / 5 implements the long string of adds to implement the times 7 and does the divide with a gcc library

12: 00000000 0 NOTYPE GLOBAL DEFAULT UND __udivmodhi4

That is going to save you on overall code space and should save on execution when you would have normally had to deal with generic floating point libraries.

Man, I didnt realize you were using signed numbers.

else if ( A > 524 ) A= (A * 7 / 5) - 760;
else if ( A > 136 ) A= (A * 7 / 5) - 685;

both result in negative numbers, I assume coming in it is positive and going out signed?

Okay as I expected this actually gets us somewhere

unsigned int fun ( unsigned int A )
{
 unsigned char B;
 B = (A>>8)&0xFF;
 if(B==0) A = 7;
 if(B==1) A = 18;
 if(B==2) A = 307;
 if(B==3) A = 211;
 return(A);
}
 0: 99 23 and r25, r25
 2: 01 f0 breq .+0 ; 0x4 <fun+0x4>
 4: 91 30 cpi r25, 0x01 ; 1
 6: 01 f4 brne .+0 ; 0x8 <fun+0x8>
 8: 82 e1 ldi r24, 0x12 ; 18
 a: 90 e0 ldi r25, 0x00 ; 0
 c: 08 95 ret

we know that it is using a register for the upper half and a register for the lower half so the shift of 8 with or without the and for completeness simply tells the compiler just use the upper 8 bit register.

Now I am assuming an unsigned incoming and signed outgoing since you dont really have anything in the negative range coming in except for the everything less than or equal to 136.

if you make a table of all the answers. (it comes out to 679 unique answers, if you could only limit it to that).

Ugh, nothing is nice about these numbers.

If your input is unsigned which I suspect it is not, you could look at the upper byte using the B=A>>8 thing and if it is greater than 4 then return 1000. And that only really saves you one 16 bit compare becoming an 8 bit compare with this instruction set and the nature of your numbers.

The next optimization is get rid of the floating point and replace it with (A*7)/5. That is the number one thing you can do with this code to make it faster.

If you have a range of values you expect more than others, put those ranges first and it has to hop through less code for the more common values. If it is an even distribution then either of jonks if then else trees makes it easier to read and might save an instruction here and there. The compiler may still move the tree around to suit its needs or an optimization it may think it sees.

The killers are the two ranges that have the math, if you are willing to burn 960 bytes of program space you can pre-build look up tables for these

extern unsigned int LUTONE[];
extern unsigned int LUTTWO[]
unsigned int fun ( unsigned int A )
{
if ( A <= 136 ) A= 100;
else if ( A < 476 ) A= LUTONE[A-477];
else if ( A <= 524 ) A= 500;
else if ( A <= 860 ) A= LUTTWO[A-860];
else A= 1000;
 return(A);
}

The lookups look something like this

 30: 88 0f add r24, r24
 32: 99 1f adc r25, r25
 34: e8 2f mov r30, r24
 36: f9 2f mov r31, r25
 38: e0 50 subi r30, 0x00 ; 0
 3a: f0 40 sbci r31, 0x00 ; 0
 3c: 80 81 ld r24, Z
 3e: 91 81 ldd r25, Z+1 ; 0x01
 40: 08 95 ret
 42: 88 0f add r24, r24
 44: 99 1f adc r25, r25
 46: e8 2f mov r30, r24
 48: f9 2f mov r31, r25
 4a: e0 50 subi r30, 0x00 ; 0
 4c: f0 40 sbci r31, 0x00 ; 0
 4e: 80 81 ld r24, Z
 50: 91 81 ldd r25, Z+1 ; 0x01

which is not much more math than the A*7 the loads on a platform like this are dependent on the speed of the flash relative to the clock it is possibly a single cycle per if not then probably two. But the instruction fetches plus the code to do the divide by 5 with a generic division library for 16 bit numbers should be way way way more costly speed wise. Resource consumption-wise the float libraries are probably smaller than the 900 some bytes, maybe, you can try it and find out. If you are doing float elsewhere then you are burning that flash anyway but that doesnt excuse using float on an AVR esp in code you want to go fast, so still do the integer math if you dont use a look up table.

Short answer get rid of the A * 1.4 and replace with (A * 7) / 5. If you know the data is going to be heavy in certain ranges more than others put those ranges first for performance, otherwise make it more readable with one of jonk's if-then-else trees: The smallest range to largest or largest to smallest. You might save an instruction or two on the deeper cases.

@tozheneznayu - (A * 7)/5 is more expensive than a table look up from flash. It is common in Arduino projects to have quite a lot of flash available, so the tables may fit.
@tozheneznayu you can/should do some experiments, build with float and build with integer math, does the binary (not the elf file or whatever but the actual flash binary image) change sizes much? Create a test using a timer that calls your function with a value in each of the ranges many (thousands) times to time the duration, divide by the number of times you called it and get a rough idea of how fast/slow, rearrange the if/then/else tree or change between fixed point and floating point math and see how that changes.
Each time one looks at assembler like this, one should think "Why on earth am I not using a 32 bit MCU for?"
Instead of (A * 7)/5 one may want to try (A * 14)/10. Last time I checked, gcc did an amazing job on an integer divide-by-10; probably because that's a much-used operation.
its the same with the divide by a power of two taken out. if they can divide by 10 well they can divide by 5 even better, more accurate if you are using the multiply by 1/5th thing, shift then multiply by 1/5th instead of multiply by 10th. if doing long division then the maybe 10 is faster sure, have to think about it.

Stack Exchange Network

Code Optimization of IF Statements?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Code Optimization of IF Statements?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions