| [画像:Paul Hsieh's programming resources] |
Programming Optimization
|
You will probably notice a large slant towards Intel x86 based optimization techniques, which shouldn't surprise many since that is where my background is strongest. On the other hand, I have used various other architectures, run profilers and debuggers on a variety of non-x86 UNIX boxes; I have tried to be as general as possible where I can. However, many of the recommendations and techniques may simply not work for your processor or environment. In that event, I should emphasize that first-hand knowledge is always better than following simplistic mantras. I would also appreciate any feedback you might have regarding other platforms or other optimization techniques you have experience with.
I have written up this page for a few reasons: (1) I have seen almost nothing of reasonable quality or depth elsewhere (2) I hope to get feedback from others who may know a thing or two, that I don't (3) To enrich the quality of the internet (4) Expensive books on this subject have been getting a bit too much attention (5) To get more hits on my web page :o)
"None of us learn in a vacuum; we all stand on the shoulders of giants such as Wirth and Knuth and thousands of others. Lend your shoulders to building the future!"
-Michael Abrash
So without further ado, I present to you Programming Optimization ...
[画像:opeq1]
where
[画像:opeq2]What is typically seen is that Total time is usually dominated by one or very few of the time of taski terms because either those tasks are where there is the most work to do or where their algorithms are complicated enough for them to have a low rate of execution. This effect tends to be recursive in the sense of breaking tasks down into sub-tasks, which has lead to the creation and adoption of line-by-line based execution profilers.
What one should also notice is that if the strategy for improving the performance of code is to improve the efficiency of the code for one particular taski we can see that it's possible to reach a point of diminishing returns, because each term cannot go below 0 in its overall contribution to the time consumption. So an optimization exercise of this sort for any particular task need not proceed past the point where the total time taken for any taski is significantly below the average. This usually means that effort is better spent in switching to another taski to optimize.
This, of course, is fine and a credible way of proceeding, however, by itself leaves out algorithm analysis, as well as the possibility of trying different task breakdowns. The classic example here is that tweaking bubble sort is not going to be as useful as switching to a better algorithm, such as quicksort or heapsort if you are sorting a lot of data.
One also has to be careful about what actually is being measured. This is epitomized by the story that if the Unix kernel typically spends most of its time in the idle loop, that doesn't mean there will be any useful advantage in optimizing it.
(1) Data bandwidth performance for PC devices are roughly ordered (slowest to fastest): user input device, tape drives, network, CDROM, hard drive, memory mapped local BUS device (graphics memory), uncached main memory, external cached main memory, local/CPU cached memory, local variables (registers.)
The usual approaches to improve the performance of data bandwidth is to use a faster device to cache the data for slower devices to allow pre-emption of slower device access if the access is redundant. For example, to know where you are pointing your mouse, your mouse interface is likely to just look up a pair of coordinates saved to memory, rather than poll the mouse directly every time. This general idea is probably what inspired Terje Mathisen (a well-known programming optimization guru) to say: "All programming is an exercise in caching."
(2) Arithmetic operation performance is ordered roughly: transcendental functions, square root, modulo, divide, multiply, add/subtract/mutiply by power of 2/divide by power of 2/modulo by a power of 2.
Nearly all programs have to do some rudimentary calculations of some kind. Simplifying your formulae to use faster functions is a very typical optimization opportunity. The integer versus floating point differences are usually very platform dependent, but in modern microprocessors, the performance of each is usually quite similar. Knowing how data bandwidth and calculations perform relative to each other is also very important. For example, using tables to avoid certain recalculations is often a good idea, but you have to look carefully at the data speed versus recalculation; large tables will typically be uncached and may perform even slower than a CPU divide.
(3) Control flow is ordered roughly by: indirect function calls, switch() statements, fixed function calls, if() statements, while() statements.
But a larger additional concern with modern pipelined processors is the "predictability" of a control transfer. Modern processors essentially guess the direction of a control transfer, and back out if and when they realize that they are wrong (throwing away what incorrect work they have done.) Incorrect predictions are very costly. As such one must be aware of arithmetically equivalent ways of performing conditional computation.
Action video games are multimedia applications. They require graphics, sound, possibly a network connection, and user input. In a PC, the hardware devices that support these features are completely external to the CPU. In fact, the algorithmic issues of artificial intelligence, collision detection, keeping score, and keeping track of time are ordinarily a very low performance hit on a good CPU (like a Pentium.) Understanding your multimedia devices will be far more helpful than knowing how to optimize every cycle out of your CPU. This naturally leads to the recommendation of using a compiler for a high-level language like C, and using flexible device libraries.
Designing the interaction with the hardware devices will be much more important in terms of the quality of your game, in more ways than just speed. The wide variety of target hardware devices on your audience's PCs can make this a complicated problem. However, it is a good bet that using asynchronous APIs will be the most important consideration. For example, using potentially available coprocessors for parallel/accelerated graphics rendering is likely to be a better alternative to CPU based pixel filling. If your audio card support can use large buffered sound samples, you again will be better off than hand holding tiny wave buffers. User input should be handled by a state mechanism rather than polling until some desired state to occurs (DOOM versus NetHack.) Relatively speaking, the network ordinarily runs at such a slow pace as to justify running a separate thread to manage it. Again, state based network management is better than polling based.
But what I saw did not impress me. I found I had plenty of room to work on optimizing the original source code, before making special assembly language optimizations. I was able to apply many common tricks that could not be found with the compiler: obvious sub-expression elimination, hoisting (factoring), tail recursion elimination, and impossible condition dead code elimination.
Well, in any event, the program stood to gain from high-level improvements that proved to be a valuable aid to the low-level improvements added afterward via analyzing the compiler output. Had I done this in the reverse order, I would have duplicated massive amounts of inefficiencies that I would have spent much longer isolating, or worse may have missed entirely.
This is what I am talking about:
/* Set our cap according to some variable condition */
lpDRV->dpCAPS |= x;
...
/* This is an important inner loop */
for(i=0; i<n; i++)
{
if( (lpDRV->dpCAPS) & CLIPPING )
{
DoOneThing(i);
}
else if( (lpDRV->dpCAPS) & ALREADYCULLED )
{
DoSomethingElse(i);
}
else
{
DoYetAnotherThing(i);
}
}
Now assuming that there are no side effects which modify lpDRV->dpCAPS variable within the DoXXX() functions, doesn't the above code scream at you to do some basic control transformations? I mean hoist the damn "if" statements to the outside of the loop for crying out loud!
/* Set our cap according to some variable condition */
lpDRV->dpCAPS |= x;
...
/* This is an important inner loop */
if( (lpDRV->dpCAPS) & CLIPPING )
{
for(i=0; i<n; i++)
{
DoOneThing(i);
}
}
else if( (lpDRV->dpCAPS) & ALREADYCULLED )
{
for(i=0; i<n; i++)
{
DoSomethingElse(i);
}
}
else
{
for(i=0; i<n; i++)
{
DoYetAnotherThing(i);
}
}
Now, lets do the math. How much are the if statements costing us? In the first loop its n times the overhead for all the if-else logic. In the second loop its 1 times the overhead for all the if-else logic. So in the cases where n<1 the company in Redmond wins, but in the case where n>1 I win. This is not rocket science. I swear, if that enormous Redmond based company only understood really basic elementary kindergarten optimizations like this, the world would be a much better place.
float x[MAX_ENTRIES];
...
if( x[i] ) {
....
}
Well, unfortunately, the compiler doesn't know what I know. That is that all x86 processors (this was originally written before the Althon or Pentium 4 processor -- certainly with the Athlon processor the technique used below will have no difference in performance) are very slow at converting floating point data types into integer data types including condition codes which are required for branching.
Fortunately, memory is quickly accessible from both floating point and integer. So relying on x being in memory what I needed to do was match the bit patterns of each x[i] to 0. There is a slight wrinkle here in that the IEEE specification say that both 0 and -0 can be distinctly represented, but for all computation purposes must perform identically.
Well, I don't want to get into the details of the IEEE spec here, so I'll just show you the improvement:
float x[MAX_ENTRIES];
long temp;
...
temp = *((long *)&(x[i]));
if( temp + temp ) {
....
}
This idea is non-portable; its a 32 bit IEEE specific trick. But for the substantial performance increase it gives, it is impossible to ignore.
Here's one sent in to me from "cdr":
Memory in modern computers is most definitely NOT "random access".
(1) If you don't optimize your code, but your competition does, then they end up with a faster solution, independent of hardware improvements. (2) Hardware performance improvements expectations almost always exceeds reality. PC disk performance has remained relatively stable for the past 5 years, and memory technology has been driven more by quantity than performance. (3) Hardware improvements often only target optimal software solutions.
This is not always true. Remember that in recalculating, you have the potential of using parallelism, and incremental calculation with the right formulations. Tables that are too large will not fit in your cache and hence may be very slow to access. If your table is multi-indexed, then there is a hidden multiply which can be costly if it is not a power of 2. On a fast Pentium, an uncached memory access can take more time than a divide.
This is something some people can only be convinced of by experience. The cogitations and bizarre operator fumbling that you can do in the C language convinces many that they are there for the purposes of optimization. Modern C compilers usually unify C's complex operators, semantics and syntax into much a simpler format before proceeding to the optimization and code generation phase. ANSI C only addresses a subset of your computers' capabilities and is in of itself too generic in specification to take advantage of all of your processor nuances. Remember ANSI C does not know the difference between cached and uncached memory. Also many arithmetic redundancies allow for usage of processor features that C compilers to date have not yet mastered (for e.g.., there are clever tricks for host based texture mapping if the stride of the source texture is a power of two, or in particular 256 on an x86.)
Unfortunately, many research based, and work station programmers as well as professors of higher education, who might even know better, have taken it upon themselves to impress upon newbie programmers to avoid assembly language programming at all costs; all in the name of maintainability and portability. A blatant example of this can be seen in the POV-ray FAQ, which outright recommends that there is no benefit to be had in attempting assembly language optimizations. (I wouldn't be surprised, if you couldn't simply low level optimize POV-Ray, change the interface and turn around and sell the darn thing!) The fact is, low-level optimization has its place and only should be passed by if there is a conflicting requirement (like portability), there is no need, or there are no resources to do it. For more, see High level vs. Low level below.
Most C compilers come with an "inline assembly" feature that allows you to roll your own opcodes. Most also come with linkers that allow you to link completely external assembly modules. Of course not all C compilers are created equal and the effects of mixing C and assembly will vary depending on the compiler implementation. (Example: WATCOM and DJGPP mix ASM in very smoothly, whereas VC++ and Borland do not.)
Modern C compilers will do a reasonable job if they are given assistance. I usually try to break my inner loops down into the most basic expressions possible, that are as analogous to low level assembly as possible, without resorting to inline assembly whenever possible. Again your results will vary from compiler to compiler. (The WATCOM C/C++ compiler can be helped significantly with this sort of approach.)
This method replaces each bitmap graphics source data word with a specific CPU instruction to store it straight to graphics memory. The problem with it is, that it chews up large amounts of instruction cache space. This is to be compared against a data copying routine which needs to read the source data from memory (and typically caches it.) Both use lots of cache space, but the compiled bitmap method uses far more, since it must encode a CPU store command for each source data word.
Furthermore, CPU performance is usually more sensitive to instruction cache performance than it is to data cache performance. The reason is that data manipulations and resource contentions can be managed by write buffers and modern CPU's ability to execute instructions out of order. With instruction data, if they are not in the cache, they must be prefetched, paying non-overlapping sequential penalties whenever the pre-fetch buffer runs out.
On older x86's this method worked well because the instruction prefetch penalties were paid on a per instruction basis regardless (there was no cache to put them into!) But starting with the 486, this was no longer a sensible solution since short loops paid no instruction prefetch penalties, which rendered the compiled bitmap technique completely useless.
This keyword is a complete placebo in most modern C compilers. Keep in mind that K&R and the ANSI committee did not design the C language to embody all of the performance characteristics of your CPU. The bulk of the burden of optimizing your C source, is in the hands of your compiler's optimizer which will typically have its own ideas about what variables should go where. If you are interested in the level optimizations available by hand assigning register variable aliasing, you are better off going to hand rolled assembly, rather than relying on these kinds of language features.
(Addenda: The only real purpose of "register" is to assert to the compiler that an auto variable is never addressed and therefore can never alias with any pointer. While this might be able to assist the compiler's optimizer, a good optimizing compiler is more than capable of deducing this feature of a local by itself.)
Most modern C compilers will alias local variables to your CPUs register's or SRAM. Furthermore, if all variables in a given scope are local, then an optimizing compiler, can forgo maintaining the variable outside the scope, and therefore has more simplification optimization opportunities than with globals. So, in fact, you should find the opposite tends to be true more of the time.
The original reason int was put into the C language was so that the fastest data type on each platform remained abstracted away from the programmer himself. On modern 32 and 64 bit platforms, small data types like chars and shorts actually incur extra overhead when converting to and from the default machine word sized data type.
On the other hand, one must be wary of cache usage. Using packed data (and in this vein, small structure fields) for large data objects may pay larger dividends in global cache coherence, than local algorithmic optimization issues.
Most modern CPUs have a separate floating point unit that will execute in parallel to the main/integer unit. This means that you can simultaneously do floating point and integer calculations. While many processors can perform high throughput multiplies (the Pentium being an exception) general divides and modulos that are not a power of two are slow to execute (from Cray Super Computers right on down to 6502's; nobody has a really good algorithm to perform them in general.) Parallelism (via the usually undertaxed concurrent floating point units in many processors) and redundancy are often better bets than going to fixed point.
On the redundancy front, if you are dividing or calculating a modulo and if you know the divisor is fixed, or one of only a few possible fixed values there are ways to exploit fast integer (aka fixed point) methods.
On the Pentium, the biggest concern is moving data around to and from the FPU and the main integer unit. Optimizing FPU usage takes careful programming; no x86 compiler I have see does a particularly good job of this. To exploit maximum optimization potential, you are likely going to have to go to assembly language. As a rule of thumb: if you need many simple results as fast as possible, use fixed point, if you need only a few complicated results use floating point. See Pentium Optimizations by Agner Fog for more information.
With the introduction of AMD's 3DNOW! SIMD floating point technology, these older rules about floating point performance have been turned upside down. Approximate (14/15 bits) divides, or reciprocal square roots can be computed in a single clock. Two multiplies and two adds can also be computed per clock allowing better than 1 gigaflop of peak performance. We are now at a point in the industry where floating point performance is truly matching the integer performance. With such technologies the right answer is to use the data type format that most closely matches its intended meaning.
You usually get a lot more mileage out of optimizing your code at a high level (not meaning to imply that you need a HLL to do this) first. At the very least, changes to the high level source will tend to affect more target code at one time than what you will be able to do in assembly language with the same effort. In more extreme cases, such as exponential (usually highly recursive) algorithms, thorough hand optimization often buys you significantly less than good up front design.
The cycle counts given in processor instruction lists are usually misleading about the real cycle expenditure of your code. They usually ignore the wait states required to access memory, or other devices (that usually have their own independent clocks.) They are also typically misleading with regards to hidden side effects of branch targets, pipelining and parallelism.
The Pentium can take up to 39 clocks to perform a floating
point divide. However, the instruction can be issued in one clock, 39
clocks of integer calculations can then be done in parallel and the
result of the divide then retrieved in another clock. About 41 clocks in
total, but only two of those clocks are actually spent issuing and retrieving
the results for the divide itself.
In fact, all modern x86 processors have internal clock timers than can be used
to assist in getting real timing results and Intel recommends that programmers
use them to get accurate timing results. (See the RDTSC instruction as
documented in Intel's processor architectural manuals.)
The benefits of assembly over C are the same under Windows or Linux as they are under DOS. This delusion doesn't have anything close to logic backing it up, and therefore doesn't deserve much comment.
See Iczelion's Win32 Assembly Home Page if you don't believe me.
This is not a technical belief -- its a marketing one. Its one often heard from the folks that live in Redmond, WA. Even to the degree that it is true (in a very large software project, for example) it ignores the fact that optimal performance can be approached asymptotically with a finite, and usually acceptable amount of effort. Using proper profiling and benchmarking one can iteratively grab the "low hanging fruit" which will get most of the available performance.
Absolutely optimization is also not a completely unattainable goal. Understanding the nature of your task, and bounding it by its input/output performance and the best possible algorithm in the middle in many cases is not an undoable task.
For example, to read a file from disk, sort its contents and write the result back out, ought to be a very doable performance optimization exercise. (The input/output performance is known, and the algorithm in the middle is approachable by considering the nature of the input and going with a standard algorithm such as heap sort or radix sort.)
Of course, the degree of analysis that you can apply to your specific problem will vary greatly depending on its nature. My main objection to this misconception is just that it cannot be applied globally.
if( Condition ) {
Case A;
} else {
Case B;
}
After:
Case B;
if( Condition ) {
Undo Case B;
Case A;
}
for(i=0;i<10;i++) {
printf("%d\n",i*10);
}
After:
for(i=0;i<100;i+=10) {
printf("%d\n",i);
}
char playfield[80][25];After:
char playfield[80][32];
for(i=0;i<100;i++) {
map[i].visited = 0;
}
After:
i=99;
do {
map[i].visited = 0;
i--;
} while(i>=0);
char x; int y; y = x;After:
int x, y; y = x;
void swap(int *x, int *y) {
int t;
t = *y;
*y = *x;
*x = t;
}
After:
static void swap(int *x, int *y) {
int t;
t = *y;
*y = *x;
*x = t;
}
float f[100],sum;
...
/* Back to back dependent
adds force the thoughput
to be equal to the total
FADD latency. */
sum = 0;
for(i=0;i<100;i++) {
sum += f[i];
}
After:
float f[100],sum0,sum1;
float sum2,sum3,sum;
...
/* Optimized for a 4-cycle
latency fully pipelined
FADD unit. The throughput
is one add per clock. */
sum0 = sum1 = sum2 = sum3 = 0;
for(i=0;i<100;i+=4) {
sum0 += f[i];
sum1 += f[i+1];
sum2 += f[i+2];
sum3 += f[i+3];
}
sum = (sum0+sum1) + (sum2+sum3);
Also, in theory, a SIMD enabled compiler could direct the above
to a SIMD
instruction set (such as 3DNow!, SSE or AltiVec.) I definitely
have not seen
this yet.
Strictly for beginners
Techniques you might not be aware of if you have not been programming for
the past 15 years of your life:
vincent@realistix.com says...> Hi....I've got a question on optimizing critical loops.....In the context of> C/C++, which is more expensive?> IF statements or function calls? I need to choose between the two because of> the rendering options that I pass to the rendering loop.
The ordering (fastest first) is roughly:
1. Well predicited if() statements (that means comparisons that follow
either a very simple short pattern, or are heavily weighted, i.e., 90% or
more, to one result over the other.)
2. Fixed function calls. The address does not change. A "static"
function call will usually be faster and the fewer number of parameters
the better (though if declared static, sometimes this doesn't matter.)
3. Indirect function calls that have a high probability of going to the
same address that it went to last time. Modern CPUs will predict that
the indirect address will be the same as last time.
...
4. Non-predictable if() statements. This is when the pattern of the
results of the comparison in the if() statement is not cyclical, and
sways back and forth in a non-simplistic pattern. The penalty for this
is *MUCH* higher than anything listed to this point so far. A CPU like
the Athlon or P-!!! throws out about 10-18 clocks of work every time it
guesses wrong (which you can expect to be about 50% of the time).
5. Changing indirect function calls. This is when the indirect function
call target changes on nearly every call. The penalty is basically the
same as a non-predictable branch, plus a few extra clocks (since it takes
longer to fetch an address than compute a comparison.)
---
The first three have relatively low clock counts and allow the CPU to
perform parallel work if there are any opporunties. The last two lead
the CPU to specualtively perform wrong work that needs to be undone (this
is usually just a few clocks of addition overhead). The major point of
which is that the CPU waits roughly the length of its pipeline before it
can perform real work.
One way to work around a situation such as 4/5 is that if there are only
two cases, you actually perform the work for both cases, and use a
predicate mask to select between the answers. For example to perform a
max, min and abs functions on an x86 compiler:
static int int_max(int a, int b) {
b = a-b;
a -= b & (b>>31);
return a;
}
static int int_abs(int a) {
return a - ((a+a) & (a>>31));
}
static int int_min(int a, int b) {
b = b-a;
a += b & (b>>31);
return a;
}
Notice that there is no branching. Thus the predictability of whether or
not one result or the other is chosen does not factor into the
performance (there is no branching calculation performed.)
--
Paul Hsieh
http://www.pobox.com/~qed/optimize.html
Chris Lomont wrote: : Where do you get your ideas about experienced coders? On RISC chips : and many of the pipelined, superscalar CPU's a good compiler beats out : most ASM programmers due to the complex issues for instruction : pariing ... Most assembly programmers are neither knowledgable or good, but those who are knowledgable and good have little trouble beating compilers. Even on true RISC CPUs. A friend had an algorithm reimplemented, I'll redundantly stress that the algorithm was not changed, in PowerPC assembly by an Apple engineer. The result was a pixel doubling blitter that ran 3 to 5 times faster. I've personally seen a graphics intensive game get 5 more frames per second (25, up from 20, on a P5-100) by replacing low level graphics code compiled by VC++ 5 with handwritten assembly. Again, same algorithm, different implementation. : ... Your hand crafted 486 asm will probably run dog slow on later : processors with different pairing rules, while a simple recompile on a : decent PII compiler will make that C code surpass you old asm ... You overstate the risk. Coincidentally I recently recompiled some code with VC++ 5 optimizing for both Pentium and PentiumPro. Comparing some 3D point transformations coded in C and Pentium assembly, the assembly code ran 30% faster than the Pentium optimized C code on a P5-100 and 12% faster than the PentiumPro optimized C code on a PII-300. While the older architecture's assembly code had less of an improvement on the newer architecture, it was still faster, and using such old code would not be a liability as you suggest. Perhaps when spanning more than two architecural generations, as in your 486 to PentiumPro example, there would be such a problem. But given the lifespan of typical products, especially games, that is not worth worrying about. BTW, I'm not advocating that everything should be done in assembly. Just that claiming C compilers are hard to beat is naive, and that assembly is often a good solution at times.
For more, see Randall Hyde's Great Debate page. © Copyright 1996-2016, Paul Hsieh All Rights Reserved.