Programming Optimization: Techniques, examples and discussion

Programming Optimization

Introduction Code Architecture Low Level Strategies Common Misconceptions A Few Good Tricks Links High Level vs. Low Level

Introduction

This is a page about the elusive subject of program performance optimization. Before you proceed towards such lofty goals, you should examine your reasons for doing so. Optimization is but one of many desirable goals in software engineering and is often antagonistic to other important goals such as stability, maintainability, and portability. At its most cursory level (efficient implementation, clean non-redundant interfaces) optimization is beneficial and should always be applied. But at its most intrusive (inline assembly, pre-compiled/self-modified code, loop unrolling, bit-fielding, superscalar and vectorizing) it can be an unending source of time-consuming implementation and bug hunting. Be cautious and wary of the cost of optimizing your code.

You will probably notice a large slant towards Intel x86 based optimization techniques, which shouldn't surprise many since that is where my background is strongest. On the other hand, I have used various other architectures, run profilers and debuggers on a variety of non-x86 UNIX boxes; I have tried to be as general as possible where I can. However, many of the recommendations and techniques may simply not work for your processor or environment. In that event, I should emphasize that first-hand knowledge is always better than following simplistic mantras. I would also appreciate any feedback you might have regarding other platforms or other optimization techniques you have experience with.

I have written up this page for a few reasons: (1) I have seen almost nothing of reasonable quality or depth elsewhere (2) I hope to get feedback from others who may know a thing or two, that I don't (3) To enrich the quality of the internet (4) Expensive books on this subject have been getting a bit too much attention (5) To get more hits on my web page :o)

"None of us learn in a vacuum; we all stand on the shoulders of giants such as Wirth and Knuth and thousands of others. Lend your shoulders to building the future!"

-Michael Abrash

So without further ado, I present to you Programming Optimization ...

The General Task of Code Optimization

It should be pointed out, that over time the idea of code optimization has evolved to include "execution profiling" (i.e., direct measurement of "hotspots" in the code from a test run) as its guiding strategy. To understand this and put it into perspective, let us first break down the performance of a program down as follows:

[画像:opeq1]

where

[画像:opeq2]

What is typically seen is that Total time is usually dominated by one or very few of the time of task_i terms because either those tasks are where there is the most work to do or where their algorithms are complicated enough for them to have a low rate of execution. This effect tends to be recursive in the sense of breaking tasks down into sub-tasks, which has lead to the creation and adoption of line-by-line based execution profilers.

What one should also notice is that if the strategy for improving the performance of code is to improve the efficiency of the code for one particular task_i we can see that it's possible to reach a point of diminishing returns, because each term cannot go below 0 in its overall contribution to the time consumption. So an optimization exercise of this sort for any particular task need not proceed past the point where the total time taken for any task_i is significantly below the average. This usually means that effort is better spent in switching to another task_i to optimize.

This, of course, is fine and a credible way of proceeding, however, by itself leaves out algorithm analysis, as well as the possibility of trying different task breakdowns. The classic example here is that tweaking bubble sort is not going to be as useful as switching to a better algorithm, such as quicksort or heapsort if you are sorting a lot of data.

One also has to be careful about what actually is being measured. This is epitomized by the story that if the Unix kernel typically spends most of its time in the idle loop, that doesn't mean there will be any useful advantage in optimizing it.

Code Architecture

In general design changes tend to affect performance more than "code tweaking". So clearly one should not skip straight to assembly language until higher level approaches have been exhausted.

Simple mathematical analysis. Calculate the approximate running time of your algorithm (i.e., calculate its "O") taking all bottlenecks into account. Is it optimal? Can you prove it? Can you justify up your algorithmic design with theoretically known results?
Understanding technological performance considerations. Even at high levels, you need to have a rough understanding of what goes fast, what doesn't and why. Examples follow:
(1) Data bandwidth performance for PC devices are roughly ordered (slowest to fastest): user input device, tape drives, network, CDROM, hard drive, memory mapped local BUS device (graphics memory), uncached main memory, external cached main memory, local/CPU cached memory, local variables (registers.)

The usual approaches to improve the performance of data bandwidth is to use a faster device to cache the data for slower devices to allow pre-emption of slower device access if the access is redundant. For example, to know where you are pointing your mouse, your mouse interface is likely to just look up a pair of coordinates saved to memory, rather than poll the mouse directly every time. This general idea is probably what inspired Terje Mathisen (a well-known programming optimization guru) to say: "All programming is an exercise in caching."

(2) Arithmetic operation performance is ordered roughly: transcendental functions, square root, modulo, divide, multiply, add/subtract/mutiply by power of 2/divide by power of 2/modulo by a power of 2.

Nearly all programs have to do some rudimentary calculations of some kind. Simplifying your formulae to use faster functions is a very typical optimization opportunity. The integer versus floating point differences are usually very platform dependent, but in modern microprocessors, the performance of each is usually quite similar. Knowing how data bandwidth and calculations perform relative to each other is also very important. For example, using tables to avoid certain recalculations is often a good idea, but you have to look carefully at the data speed versus recalculation; large tables will typically be uncached and may perform even slower than a CPU divide.

(3) Control flow is ordered roughly by: indirect function calls, switch() statements, fixed function calls, if() statements, while() statements.

But a larger additional concern with modern pipelined processors is the "predictability" of a control transfer. Modern processors essentially guess the direction of a control transfer, and back out if and when they realize that they are wrong (throwing away what incorrect work they have done.) Incorrect predictions are very costly. As such one must be aware of arithmetically equivalent ways of performing conditional computation.
Localized tricks/massaging. Are your results computed by complex calculations at each stage, when an incremental approach can produce results faster, or vice versa? Example: What is the fastest way to compute the nth Fibonacci numbers? What is the fastest way to compute the first n Fibonacci numbers? In the former case, there is a simple formula involving an exponent calculation. In the latter case, the recursive definition allows for an iterative approach which only requires an addition and variable shuffling at each stage.
Parallelism. Although the typical programming model is a single threaded one, sometimes the width of the data stream is a lot smaller than the naturally supported data types. More concretely, if your algorithm operates on bytes you may be able to operate on 2, 4 or 8 of them simultaneously by using word based instructions that are supported by your CPU. Alternatively, does your target system have multiple execution units? For example, do you have a graphics coprocessor which can draw in parallel to ordinary CPU computation and do you have an API for it?
Taking a global view. At some stage, you should take a step back and look at the entire flow of your algorithms. Are you using a general architecture to solve a problem that could be more efficiently dealt with a specific architecture? For that matter, does your procedural breakdown allow for easy analysis of performance in the first place (see profiling below)?

Often it's difficult to know for sure if an algorithm is optimal. But consider it from a data types point of view. How often are data accessed? Does it make sense to cache certain data? Are you using standard algorithmic speed up tricks such as using hash tables?

If your application has been written in modules, that usually means that each module fulfills some sort of API to interact with the other modules. However, often a modification in the API can lead to more efficient communication between modules.

Applications, when first written often contain hidden or unexpectedly expensive bottlenecks just from redundancy. Using some kind of a trace with debugger or profiler can often be useful in ferreting out your wasteful code.

One of the most powerful techniques for algorithmic performance optimization that I use to death in my own programming is hoisting. This is a technique where redundancies from your inner loops are pulled out to your outer loops. In its simplest form, this concept may be obvious to many, but what is commonly overlooked are cases where intentional outer loop complexification can lead to faster inner loops.

One of the most impressive examples of this is in the field of artificial intelligence where Shannon's famous Min-Max algorithm (for calculating a game tree value) is substantially improved using a technique called Alpha-Beta Pruning. The performance improvement is exponential even though the algorithms contain substantially the same content and identical leaf calculations. Further problem-specific improvements are found by heuristical ordering according to game specific factors which often yield further performance improvements by large factors.

Examples

Suppose you wish to create an arcade style fast action game on a PC. You know that optimization will be an important consideration from what you've heard of how established game companies do this sort of thing, and common knowledge on the net. If not thought through correctly, this can lead you into all sorts of strange and questionable decisions. The most typical mistake is believing that you must write your program in assembler to take maximal advantage of the CPU's performance.

Action video games are multimedia applications. They require graphics, sound, possibly a network connection, and user input. In a PC, the hardware devices that support these features are completely external to the CPU. In fact, the algorithmic issues of artificial intelligence, collision detection, keeping score, and keeping track of time are ordinarily a very low performance hit on a good CPU (like a Pentium.) Understanding your multimedia devices will be far more helpful than knowing how to optimize every cycle out of your CPU. This naturally leads to the recommendation of using a compiler for a high-level language like C, and using flexible device libraries.

Designing the interaction with the hardware devices will be much more important in terms of the quality of your game, in more ways than just speed. The wide variety of target hardware devices on your audience's PCs can make this a complicated problem. However, it is a good bet that using asynchronous APIs will be the most important consideration. For example, using potentially available coprocessors for parallel/accelerated graphics rendering is likely to be a better alternative to CPU based pixel filling. If your audio card support can use large buffered sound samples, you again will be better off than hand holding tiny wave buffers. User input should be handled by a state mechanism rather than polling until some desired state to occurs (DOOM versus NetHack.) Relatively speaking, the network ordinarily runs at such a slow pace as to justify running a separate thread to manage it. Again, state based network management is better than polling based.

Staying on the subject of games, I was recently working on a special assembly language based technology optimization project for a 3D shoot 'em up action game. Now the original authors of this game have a reputation for being highly skilled programmers, so when I was first presented with the source code I fully expected it to be difficult to find further performance optimizations over what they had already done.

But what I saw did not impress me. I found I had plenty of room to work on optimizing the original source code, before making special assembly language optimizations. I was able to apply many common tricks that could not be found with the compiler: obvious sub-expression elimination, hoisting (factoring), tail recursion elimination, and impossible condition dead code elimination.

Well, in any event, the program stood to gain from high-level improvements that proved to be a valuable aid to the low-level improvements added afterward via analyzing the compiler output. Had I done this in the reverse order, I would have duplicated massive amounts of inefficiencies that I would have spent much longer isolating, or worse may have missed entirely.

I was recently working on some performance sensitive code based on the reference source from an enormous software vendor based in Redmond, WA. I was shocked to see that they were making fundamental errors in terms performance optimization. I am talking about simply things like loop hoisting. It got me to thinking that even so called "professional programmers" charged with the goal of optimizing code can miss really totally basic optimizations.

This is what I am talking about:


 /* Set our cap according to some variable condition */
 lpDRV->dpCAPS |= x;
 ...
 /* This is an important inner loop */
 for(i=0; i<n; i++)
 {
 if( (lpDRV->dpCAPS) & CLIPPING )
 {
 DoOneThing(i);
 }
 else if( (lpDRV->dpCAPS) & ALREADYCULLED )
 {
 DoSomethingElse(i);
 }
 else
 {
 DoYetAnotherThing(i);
 }
 }

Now assuming that there are no side effects which modify lpDRV->dpCAPS variable within the DoXXX() functions, doesn't the above code scream at you to do some basic control transformations? I mean hoist the damn "if" statements to the outside of the loop for crying out loud!


 /* Set our cap according to some variable condition */
 lpDRV->dpCAPS |= x;
 ...
 /* This is an important inner loop */
 if( (lpDRV->dpCAPS) & CLIPPING )
 {
 for(i=0; i<n; i++)
 {
 DoOneThing(i);
 }
 }
 else if( (lpDRV->dpCAPS) & ALREADYCULLED )
 {
 for(i=0; i<n; i++)
 {
 DoSomethingElse(i);
 }
 }
 else
 {
 for(i=0; i<n; i++)
 {
 DoYetAnotherThing(i);
 }
 }

Now, lets do the math. How much are the if statements costing us? In the first loop its n times the overhead for all the if-else logic. In the second loop its 1 times the overhead for all the if-else logic. So in the cases where n<1 the company in Redmond wins, but in the case where n>1 I win. This is not rocket science. I swear, if that enormous Redmond based company only understood really basic elementary kindergarten optimizations like this, the world would be a much better place.

Low Level Strategies

When all algorithmic optimization approaches have been tried and you are still an order of magnitude away from your target performance you may find that you have no choice but to analyze the situation at the lowest possible level. 3 useful resources at this point are (1) Local Paul Hsieh's x86 Assembly language Page (2) The Zen of Code Optimization (by Michael Abrash) and (3) an HTML page entitled Local Pentium Optimizations by Agner Fog

Profiling. One way or another, you have to *know* where your performance bottlenecks are. Good compilers will have comprehensive tools for measuring performance analysis which makes this job easier. But even with the best tools, care is often required in analyzing profiling information when algorithms become complex. For example, most UNIX kernels will spend the vast majority of their processor cycles in the "idle-loop". Clearly, no amount of optimization of the "idle-loop" will have any impact on performance.
Disassembling. When writing in high-level languages there can be huge complexity differences coming from subtle source code changes. While experience can help you out a lot in this area, generally the only way to be sure is to actually produce an assembly listings of your object code. Typically one will see things like: multiplications for address generations when a shift could do just as well if only the array limits were a power of 2; complex "for" loops may generate an extra jump to the condition evaluator to conserve on code size, when a do .. while loop is just as effective but wouldn't generate the extra jump. One can also see the impact of declaring variables or functions as "static". (With the WATCOM C/C++ compiler I've noticed that functions declarations will only "inline" if they are static, but that local variables use shorter instructions if they are not static.)
Using CPU/platform specific features(even in cross-platform situations.) How you deal with this is obviously CPU dependent, however here are some things to watch out for: Burst write's, memory access constraints (cache coherency), branch prediction, pipelining, counting down rather than up, side-effects, multiple execution units. With flexible optimizing compilers, many of these effects can be achieved by merely "massaging" your high level source code. The theory that you can inline assembly to make any routine faster is not always true of the better, modern optimizing compilers (C compilers anyway.)

The process of code tweaking ordinarily involves iterating the above steps. The profiling/disassembling steps will tend to get you a lot of bang with relatively little effort.

Using platform specific features usually means rolling up your sleeves and doing some assembly coding. But the results can really be worthwhile. A good example of this is the Quake software only rendering engine. As is, by now, well-known, Michael Abrash organized the code in Quake such that the floating and integer pipelines of the Pentium are executing simultaneously. Information was transferred from one unit to the other using a floating point normalization trick. The result speaks for itself.

One must be careful when deciding to invest valuable time into low-level optimizations, as an understanding of the architecture still takes precedence over shaving cycles here and there. For example, if you are trying to write a graphics intensive video game you will probably find that solving the graphics memory bottleneck will ordinarily yield more substantial results than radical CPU cycle optimization. See Optimized Block memory transfers on the Pentium for a specific example of this.

Examples

Continuing with the second example from the high level examples above, I found myself faced with the problem of optimizing a bunch of floating point code. Now, being that the platform was an x86 based PC, one of the things I was struck by was an unusually high propensity of if-statements that looked something like this:


 float x[MAX_ENTRIES];
 ...
 if( x[i] ) {
 ....
 }

Well, unfortunately, the compiler doesn't know what I know. That is that all x86 processors (this was originally written before the Althon or Pentium 4 processor -- certainly with the Athlon processor the technique used below will have no difference in performance) are very slow at converting floating point data types into integer data types including condition codes which are required for branching.

Fortunately, memory is quickly accessible from both floating point and integer. So relying on x being in memory what I needed to do was match the bit patterns of each x[i] to 0. There is a slight wrinkle here in that the IEEE specification say that both 0 and -0 can be distinctly represented, but for all computation purposes must perform identically.

Well, I don't want to get into the details of the IEEE spec here, so I'll just show you the improvement:


 float x[MAX_ENTRIES];
 long temp;
 ...
 temp = *((long *)&(x[i]));
 if( temp + temp ) {
 ....
 }

This idea is non-portable; its a 32 bit IEEE specific trick. But for the substantial performance increase it gives, it is impossible to ignore.

Here's one sent in to me from "cdr":

A while ago I wrote a simple routine to draw textured vertical walls DOOM-style. Because the Z coordinate stays constant along the Y direction, my routine drew by columns. I was distantly aware that this made my memory accesses less linear, and therefore slower, but I figured that it wasn't that important since my K6s 32k L1 data cache was big enough to fit half my texture in. When the frame rate that resulted were lower than my original 33mhz 486 got under DOOM I was rather startled and disappointed, but looking through the code I couldn't see anything that should improve speed by more than a factor of 3 or 4. Wondering if my compiler had gone berserk, or if I was just a bad programmer, I started optimizing. The second optimization I performed was to switch the X and Y coordinates when reading from my texture, which used the standard 2-dimensional array format. Suddenly the performance went up 10-fold! Analyzing the access patterns I realized that previously *every* read from my texture had been a cache miss!

Memory in modern computers is most definitely NOT "random access".

Click here for some more x86 based examples.

Common Misconceptions

The following is a list of things programmers often take as either self-evident or have picked up from others and mistakenly believe. In disputing these misbeliefs I only intend to improve awareness about performance optimization. But in doing so, I only offer half the solution. To be really convinced the aspiring performance conscious programmer should write up examples and disassemble them (in the case of using a HLL like C) or simply profile or time them to see for themselves.

Ever improving hardware, makes software optimization unimportant
(1) If you don't optimize your code, but your competition does, then they end up with a faster solution, independent of hardware improvements. (2) Hardware performance improvements expectations almost always exceeds reality. PC disk performance has remained relatively stable for the past 5 years, and memory technology has been driven more by quantity than performance. (3) Hardware improvements often only target optimal software solutions.
Using tables always beats recalculating
This is not always true. Remember that in recalculating, you have the potential of using parallelism, and incremental calculation with the right formulations. Tables that are too large will not fit in your cache and hence may be very slow to access. If your table is multi-indexed, then there is a hidden multiply which can be costly if it is not a power of 2. On a fast Pentium, an uncached memory access can take more time than a divide.
My C compiler can't be much worse than 10% off of optimal assembly on my code
This is something some people can only be convinced of by experience. The cogitations and bizarre operator fumbling that you can do in the C language convinces many that they are there for the purposes of optimization. Modern C compilers usually unify C's complex operators, semantics and syntax into much a simpler format before proceeding to the optimization and code generation phase. ANSI C only addresses a subset of your computers' capabilities and is in of itself too generic in specification to take advantage of all of your processor nuances. Remember ANSI C does not know the difference between cached and uncached memory. Also many arithmetic redundancies allow for usage of processor features that C compilers to date have not yet mastered (for e.g.., there are clever tricks for host based texture mapping if the stride of the source texture is a power of two, or in particular 256 on an x86.)

Unfortunately, many research based, and work station programmers as well as professors of higher education, who might even know better, have taken it upon themselves to impress upon newbie programmers to avoid assembly language programming at all costs; all in the name of maintainability and portability. A blatant example of this can be seen in the POV-ray FAQ, which outright recommends that there is no benefit to be had in attempting assembly language optimizations. (I wouldn't be surprised, if you couldn't simply low level optimize POV-Ray, change the interface and turn around and sell the darn thing!) The fact is, low-level optimization has its place and only should be passed by if there is a conflicting requirement (like portability), there is no need, or there are no resources to do it. For more, see High level vs. Low level below.
Using C compilers make it impossible to optimize code for performance
Most C compilers come with an "inline assembly" feature that allows you to roll your own opcodes. Most also come with linkers that allow you to link completely external assembly modules. Of course not all C compilers are created equal and the effects of mixing C and assembly will vary depending on the compiler implementation. (Example: WATCOM and DJGPP mix ASM in very smoothly, whereas VC++ and Borland do not.)

Modern C compilers will do a reasonable job if they are given assistance. I usually try to break my inner loops down into the most basic expressions possible, that are as analogous to low level assembly as possible, without resorting to inline assembly whenever possible. Again your results will vary from compiler to compiler. (The WATCOM C/C++ compiler can be helped significantly with this sort of approach.)
Compiled bitmaps are the fastest way to plot graphics
This method replaces each bitmap graphics source data word with a specific CPU instruction to store it straight to graphics memory. The problem with it is, that it chews up large amounts of instruction cache space. This is to be compared against a data copying routine which needs to read the source data from memory (and typically caches it.) Both use lots of cache space, but the compiled bitmap method uses far more, since it must encode a CPU store command for each source data word.

Furthermore, CPU performance is usually more sensitive to instruction cache performance than it is to data cache performance. The reason is that data manipulations and resource contentions can be managed by write buffers and modern CPU's ability to execute instructions out of order. With instruction data, if they are not in the cache, they must be prefetched, paying non-overlapping sequential penalties whenever the pre-fetch buffer runs out.

On older x86's this method worked well because the instruction prefetch penalties were paid on a per instruction basis regardless (there was no cache to put them into!) But starting with the 486, this was no longer a sensible solution since short loops paid no instruction prefetch penalties, which rendered the compiled bitmap technique completely useless.
Using the register keyword in strategic places C will improve performance substantially
This keyword is a complete placebo in most modern C compilers. Keep in mind that K&R and the ANSI committee did not design the C language to embody all of the performance characteristics of your CPU. The bulk of the burden of optimizing your C source, is in the hands of your compiler's optimizer which will typically have its own ideas about what variables should go where. If you are interested in the level optimizations available by hand assigning register variable aliasing, you are better off going to hand rolled assembly, rather than relying on these kinds of language features.

(Addenda: The only real purpose of "register" is to assert to the compiler that an auto variable is never addressed and therefore can never alias with any pointer. While this might be able to assist the compiler's optimizer, a good optimizing compiler is more than capable of deducing this feature of a local by itself.)
Globals are faster than locals
Most modern C compilers will alias local variables to your CPUs register's or SRAM. Furthermore, if all variables in a given scope are local, then an optimizing compiler, can forgo maintaining the variable outside the scope, and therefore has more simplification optimization opportunities than with globals. So, in fact, you should find the opposite tends to be true more of the time.
Using smaller data types is faster than larger ones
The original reason int was put into the C language was so that the fastest data type on each platform remained abstracted away from the programmer himself. On modern 32 and 64 bit platforms, small data types like chars and shorts actually incur extra overhead when converting to and from the default machine word sized data type.

On the other hand, one must be wary of cache usage. Using packed data (and in this vein, small structure fields) for large data objects may pay larger dividends in global cache coherence, than local algorithmic optimization issues.
Fixed point always beats floating point for performance
Most modern CPUs have a separate floating point unit that will execute in parallel to the main/integer unit. This means that you can simultaneously do floating point and integer calculations. While many processors can perform high throughput multiplies (the Pentium being an exception) general divides and modulos that are not a power of two are slow to execute (from Cray Super Computers right on down to 6502's; nobody has a really good algorithm to perform them in general.) Parallelism (via the usually undertaxed concurrent floating point units in many processors) and redundancy are often better bets than going to fixed point.

On the redundancy front, if you are dividing or calculating a modulo and if you know the divisor is fixed, or one of only a few possible fixed values there are ways to exploit fast integer (aka fixed point) methods.

On the Pentium, the biggest concern is moving data around to and from the FPU and the main integer unit. Optimizing FPU usage takes careful programming; no x86 compiler I have see does a particularly good job of this. To exploit maximum optimization potential, you are likely going to have to go to assembly language. As a rule of thumb: if you need many simple results as fast as possible, use fixed point, if you need only a few complicated results use floating point. See Pentium Optimizations by Agner Fog for more information.

With the introduction of AMD's 3DNOW! SIMD floating point technology, these older rules about floating point performance have been turned upside down. Approximate (14/15 bits) divides, or reciprocal square roots can be computed in a single clock. Two multiplies and two adds can also be computed per clock allowing better than 1 gigaflop of peak performance. We are now at a point in the industry where floating point performance is truly matching the integer performance. With such technologies the right answer is to use the data type format that most closely matches its intended meaning.
Performance optimization is achieved primarily by counting the cycles of the assembly code
You usually get a lot more mileage out of optimizing your code at a high level (not meaning to imply that you need a HLL to do this) first. At the very least, changes to the high level source will tend to affect more target code at one time than what you will be able to do in assembly language with the same effort. In more extreme cases, such as exponential (usually highly recursive) algorithms, thorough hand optimization often buys you significantly less than good up front design.

The cycle counts given in processor instruction lists are usually misleading about the real cycle expenditure of your code. They usually ignore the wait states required to access memory, or other devices (that usually have their own independent clocks.) They are also typically misleading with regards to hidden side effects of branch targets, pipelining and parallelism.

The Pentium can take up to 39 clocks to perform a floating point divide. However, the instruction can be issued in one clock, 39 clocks of integer calculations can then be done in parallel and the result of the divide then retrieved in another clock. About 41 clocks in total, but only two of those clocks are actually spent issuing and retrieving the results for the divide itself.

In fact, all modern x86 processors have internal clock timers than can be used to assist in getting real timing results and Intel recommends that programmers use them to get accurate timing results. (See the RDTSC instruction as documented in Intel's processor architectural manuals.)
Assembly programming is only done in DOS, there's no need under Windows
The benefits of assembly over C are the same under Windows or Linux as they are under DOS. This delusion doesn't have anything close to logic backing it up, and therefore doesn't deserve much comment.

See Iczelion's Win32 Assembly Home Page if you don't believe me.
Complete optimization is impossible; there is always room left to optimize, thus it is pointless to sustain too much effort in pursuit of it
This is not a technical belief -- its a marketing one. Its one often heard from the folks that live in Redmond, WA. Even to the degree that it is true (in a very large software project, for example) it ignores the fact that optimal performance can be approached asymptotically with a finite, and usually acceptable amount of effort. Using proper profiling and benchmarking one can iteratively grab the "low hanging fruit" which will get most of the available performance.

Absolutely optimization is also not a completely unattainable goal. Understanding the nature of your task, and bounding it by its input/output performance and the best possible algorithm in the middle in many cases is not an undoable task.

For example, to read a file from disk, sort its contents and write the result back out, ought to be a very doable performance optimization exercise. (The input/output performance is known, and the algorithm in the middle is approachable by considering the nature of the input and going with a standard algorithm such as heap sort or radix sort.)

Of course, the degree of analysis that you can apply to your specific problem will vary greatly depending on its nature. My main objection to this misconception is just that it cannot be applied globally.

A Few Good Tricks

Although I use C syntax in the examples below, these techniques clearly apply to other languages just as well.

"Else" clause removal

The performance of if-then-else is one taken jump no matter what. However, often the condition has a lop-sided probability in which case other approaches should be considered. The elimination of branching is an important concern with today's deeply pipelined processor architectures. The reason is that a "mispredicted" branch often costs many cycles.

Before:

 if( Condition ) {
 Case A;
 } else {
 Case B;
 }

After:

 Case B;
 if( Condition ) {
 Undo Case B;
 Case A;
 }

Clearly this only works if Undo Case B; is possible. However, if it is, this technique has the advantage that the jump taken case can be optimized according to the Condition probability and Undo Case B; Case A; might be merged together to be more optimal than executing each separately.

Obviously you would swap cases A and B depending on which way the probability goes. Also since this optimization is dependent on sacrificing performance of one set of circumstances for another, you will need to time it to see if it is really worth it. (On processors such as the ARM or Pentium II, you can also use conditional mov instructions to achieve a similar result.)

Use finite differences to avoid multiplies

Before:

 for(i=0;i<10;i++) {
 printf("%d\n",i*10);
 }

After:

 for(i=0;i<100;i+=10) {
 printf("%d\n",i);
 }

This one should be fairly obvious, use constant increments instead of multiplies if this is possible. (Believe it or not, however, some C compilers are clever enough to figure this out for you in some simple cases.)

Use powers of two for multidimensional arrays

Before:

 char playfield[80][25];

After:

 char playfield[80][32];

The advantage of using powers of two for all but the leftmost array size is when accessing the array. Ordinarily the compiled code would have to compute a multiply to get the address of an indexed element from a multidimensional array, but most compilers will replace a constant multiply with a shift if it can. Shifts are ordinarily much faster than multiplies.

Optimizing loop overhead

Many compilers can't optimize certain loops into the form most suitable for them. Ordinarily most CPUs have a conditional branch mechanism that works well when counting down from a positive number to zero or negative one. The conditional clause for any for loop must be evaluated on the first pass, as with any other time through the loop, but often is a trivial TRUE:

Before:

 for(i=0;i<100;i++) {
 map[i].visited = 0;
 }

After:

 i=99;
 do {
 map[i].visited = 0;
 i--;
 } while(i>=0);

Many compilers will see this optimization as well.

Data type considerations

Often to conserve on space you will be tempted to mix integer data types; chars for small counters, shorts for slightly larger counters and only use longs or ints when you really have to. While this may seem to make sense from a space utilization point of view, most CPUs have to end up wasting precious cycles to convert from one data type to another, especially when preserving sign.

Before:

 char x;
 int y;
 y = x;

After:

 int x, y;
 y = x;

Declare local functions as "static"

Doing so tells the compiler that the function need not be so general as to service arbitrary general calls from unrelated modules. If the function is small enough, it may be inlined without having to maintain an external copy. If the function's address is never taken, the compiler can try to simplify and rearrange usage of it within other functions.

Before:

 void swap(int *x, int *y) {
 int t;
 t = *y;
 *y = *x;
 *x = t;
 }

After:

 static void swap(int *x, int *y) {
 int t;
 t = *y;
 *y = *x;
 *x = t;
 }

Observe the large latency of modern CPUs

The language definition for ANSI C, does not allow it take advantage of abelian math operators for better scheduling. This will become increasingly important as CPUs in the future move towards having ALUs with larger latencies (takes longer to perform operation -- this allows for high frequency for the CPU.)

Before:

 float f[100],sum;
 ...
 /* Back to back dependent
 adds force the thoughput
 to be equal to the total
 FADD latency. */
 sum = 0;
 for(i=0;i<100;i++) {
 sum += f[i];
 }

After:

 float f[100],sum0,sum1;
 float sum2,sum3,sum;
 ...
 /* Optimized for a 4-cycle
 latency fully pipelined
 FADD unit. The throughput
 is one add per clock. */
 sum0 = sum1 = sum2 = sum3 = 0;
 for(i=0;i<100;i+=4) {
 sum0 += f[i];
 sum1 += f[i+1];
 sum2 += f[i+2];
 sum3 += f[i+3];
 }
 sum = (sum0+sum1) + (sum2+sum3);

Some compilers allow for "trade accuracy for speed with respect to floating point" switches. In theory this might allow them to see this sort of translation, however I have never seen any such thing.

Also, in theory, a SIMD enabled compiler could direct the above to a SIMD instruction set (such as 3DNow!, SSE or AltiVec.) I definitely have not seen this yet.

Strictly for beginners

Techniques you might not be aware of if you have not been programming for the past 15 years of your life:

Before:

x = y % 32;
x = y * 8;
x = y / w + z / w;
if( a==b &&c==d &&e==f ) {...}
if( (x &1) || (x &4) ) {...}
if( x>=0 &&x<8 &&
y>=0 &&y<8 ) {...}
if( (x==1) || (x==2) ||
(x==4) || (x==8) || ... )
if( (x==2) || (x==3) || (x==5) ||
(x==7) || (x==11) || (x==13) ||
(x==17) || (x==19) ) {...}

#define abs(x) (((x)>0)?(x):-(x))

int a[3][3][3];
int b[3][3][3];
...
for(i=0;i<3;i++)
for(j=0;j<3;j++)
for(k=0;k<3;k++)
b[i][j][k] = a[i][j][k];
for(i=0;i<3;i++)
for(j=0;j<3;j++)
for(k=0;k<3;k++)
a[i][j][k] = 0;

for(x=0;x<100;x++) {
printf("%d\n",(int)(sqrt(x)));
}

unsigned int x, y, a;
...
a = (x + y) >>1;
/* overflow fixup */
if( a <x &&a <y ) a += 0x8000000;

c:\>tc myprog.c
user% cc myprog.c

Use the quicksort algorithm.
Use Bresenham's line algorithm.

Look through school notes for ideas.
Ignore suggestions from others.
Code, code, code, code ... After:

x = y &31;
x = y <<3;
x = (y + z) / w;
if( ((a-b)|(c-d)|(e-f))==0 ) {...}
if( x & 5 ) {...}
if( ((unsigned)(x|y))<8 ) {...}

if( x&(x-1)==0 &&x!=0 )

if( (1<<x) & ((1<<2)|(1<<3)|(1<<5)|(1<<7) \
|(1<<11)|(1<<13)|(1<<17)|(1<<19)) ) {...}

static long abs(long x) {
long y;
y = x>>31; /* Not portable */
return (x^y)-y;
}

typedef struct {
int element[3][3][3];
} Three3DType;
Three3DType a,b;
...
b = a;
memset(a,0,sizeof(a));

for(tx=1,sx=0,x=0;x<100;x++) {
if( tx<=x ) {
tx+=2*sx+3;
sx++;
}
printf("%d\n",sx);
}

unsigned int x, y, a;
...
a = (x & y) + ((x ^ y)>>1);

c:\>wpp386 /6r/otexan myprog.c
user% gcc -O3 myprog.c

Use radix, intro or heap(!) sort.
Use fixed point DDA line algorithm.

Get examples from USENET/WEB/FTP
Get suggestions but be skeptical.
Think, code, think, code ...

Letters

vincent@realistix.com says...> Hi....I've got a question on optimizing critical loops.....In the context of> C/C++, which is more expensive?> IF statements or function calls? I need to choose between the two because of> the rendering options that I pass to the rendering loop.
The ordering (fastest first) is roughly:
1. Well predicited if() statements (that means comparisons that follow
either a very simple short pattern, or are heavily weighted, i.e., 90% or
more, to one result over the other.)
2. Fixed function calls. The address does not change. A "static"
function call will usually be faster and the fewer number of parameters
the better (though if declared static, sometimes this doesn't matter.)
3. Indirect function calls that have a high probability of going to the
same address that it went to last time. Modern CPUs will predict that
the indirect address will be the same as last time.
...
4. Non-predictable if() statements. This is when the pattern of the
results of the comparison in the if() statement is not cyclical, and
sways back and forth in a non-simplistic pattern. The penalty for this
is *MUCH* higher than anything listed to this point so far. A CPU like
the Athlon or P-!!! throws out about 10-18 clocks of work every time it
guesses wrong (which you can expect to be about 50% of the time).
5. Changing indirect function calls. This is when the indirect function
call target changes on nearly every call. The penalty is basically the
same as a non-predictable branch, plus a few extra clocks (since it takes
longer to fetch an address than compute a comparison.)
---
The first three have relatively low clock counts and allow the CPU to
perform parallel work if there are any opporunties. The last two lead
the CPU to specualtively perform wrong work that needs to be undone (this
is usually just a few clocks of addition overhead). The major point of
which is that the CPU waits roughly the length of its pipeline before it
can perform real work.
One way to work around a situation such as 4/5 is that if there are only
two cases, you actually perform the work for both cases, and use a
predicate mask to select between the answers. For example to perform a
max, min and abs functions on an x86 compiler:
static int int_max(int a, int b) {
 b = a-b;
 a -= b & (b>>31);
 return a;
}
static int int_abs(int a) {
 return a - ((a+a) & (a>>31));
}
static int int_min(int a, int b) {
 b = b-a;
 a += b & (b>>31);
 return a;
}
Notice that there is no branching. Thus the predictability of whether or
not one result or the other is chosen does not factor into the
performance (there is no branching calculation performed.)
--
Paul Hsieh
http://www.pobox.com/~qed/optimize.html

Links

High Level vs. Low Level

In the old days, it was pretty easy to understand that writing your programs in assembly would tend to yield higher performing results than writing in higher level languages. Compilers had solved the problem of "optimization" in too general and simplistic a way and had no hope of competing with the sly human assembly coder.

These days the story is slightly different. Compilers have gotten better and the CPUs have gotten harder to optimize for. Inside some research organizations, the general consensus is that compilers could do at least as well as humans in almost all cases. During a presentation I gave to some folks at AT&T Bell labs (a former employer) I explained that I was going to implement a certain piece of software in assembly language, which raised eyebrows. One person went so far as to stop me and suggest a good C/C++ compiler that would do a very good job of generating optimized object code and make my life a lot easier.

But have compilers really gotten so good that humans cannot compete? I offer the following facts: High-level languages like C and C++ treat the host CPU in a very generic manner. While local optimizations such as loop unrolling, and register resource contention are easy for compilers to deal with, odd features like 32-byte cache lines, 8K data/code cache totals, multiple execution units, and burst device memory interfaces are something not easily expressed or exploited by a C/C++ compiler.

On a Pentium, it is ordinarily beneficial to declare your data so that its usage on inner loops retains as much as possible in the cache for as long as possible. This can require bizarre declaration requirements which are most easily dealt with by using unions of 8K structures for all data used in your inner loops. This way you can overlap data with poor cache coherency together, while using as much of the remainder of the cache for data with good cache coherency.

The Pentium also has an auxiliary floating point execution unit which can actually perform floating point operations concurrently with integer computations. This can lead to algorithmic designs which require an odd arrangement of your code, that has no sensible correspondence with high-level code that computes the same thing.

Basically, on the Pentium, C/C++ compilers have no easy way to translate source code to cache structured aware data and code along with concurrently running floating point. The MMX generation of x86 processors will pose even greater problems. Nevertheless I explained to the folks at Bell Labs that I owned the compiler that they suggested, and that when it came to optimizations, I could (can) easily code circles around it.

The classic example of overlooking these points above is that of one magnanimous programmer who came from a large company and declared to the world, through USENET, that he was going to write a "100% optimal 3D graphics library completely in C++". He emphatically defended his approach with long flaming postings insisting that modern C++ compilers would be able to duplicate any hand rolled assembly language optimization trick. He got most of the way through before abandoning his project. He eventually realized that the only viable solution for existing PC platforms is to exploit the potential for pipelining the FPU and the integer CPU instructions in a maximally concurrent way. Something no x86 based C compiler in production today is capable of doing.

I always roll my eyes when the old debate of whether or not the days of hand rolled assembly language are over resurfaces in the USENET assembly language newsgroups. On the other hand, perhaps I should be thankful since these false beliefs about the abilities of C/C++ compilers in other programmers only, by comparison, differentiates my abilities more clearly to my employer.

The conclusion you should take away from this (and my other related web pages) is that when pedal to the metal kinds of performance is required, that there is a significant performance margin to be gained in using assembly language. Ordinarily, one combines C/C++ and assembly by using the compiler's inline assembly feature, or by linking to a library of hand rolled assembly routines.

Here's an off the cuff post from Anthony Tribelli on rec.games.programmer:

Chris Lomont wrote:
: Where do you get your ideas about experienced coders? On RISC chips
: and many of the pipelined, superscalar CPU's a good compiler beats out
: most ASM programmers due to the complex issues for instruction
: pariing ...
Most assembly programmers are neither knowledgable or good, but those who
are knowledgable and good have little trouble beating compilers. Even on
true RISC CPUs. A friend had an algorithm reimplemented, I'll redundantly
stress that the algorithm was not changed, in PowerPC assembly by an Apple
engineer. The result was a pixel doubling blitter that ran 3 to 5 times
faster. I've personally seen a graphics intensive game get 5 more frames
per second (25, up from 20, on a P5-100) by replacing low level graphics
code compiled by VC++ 5 with handwritten assembly. Again, same algorithm,
different implementation.
: ... Your hand crafted 486 asm will probably run dog slow on later
: processors with different pairing rules, while a simple recompile on a
: decent PII compiler will make that C code surpass you old asm ...
You overstate the risk. Coincidentally I recently recompiled some code
with VC++ 5 optimizing for both Pentium and PentiumPro. Comparing some 3D
point transformations coded in C and Pentium assembly, the assembly code
ran 30% faster than the Pentium optimized C code on a P5-100 and 12%
faster than the PentiumPro optimized C code on a PII-300. While the older
architecture's assembly code had less of an improvement on the newer
architecture, it was still faster, and using such old code would not be a
liability as you suggest. Perhaps when spanning more than two architecural
generations, as in your 486 to PentiumPro example, there would be such a
problem. But given the lifespan of typical products, especially games,
that is not worth worrying about.
BTW, I'm not advocating that everything should be done in assembly. Just
that claiming C compilers are hard to beat is naive, and that assembly is
often a good solution at times.

More recently, I got into a discussion with my brother about how he might optimize a totally C++ written program to improve the performance based on branching and floating point characteristics of modern x86 processors. After I did a brain dump and suggested some non-portable, assembly motivated techniques, he said he was able to use those suggestions alone to make his program run 10 times as fast!. This stuff is not trivial.

Paul Hsieh's Home Page Pentium Optimization x86 Assembly Programming Mail Paul Hsieh

Valid HTML 4.0!