7
\$\begingroup\$

I've read that loop unrolling can improve CPU performance by reducing the overhead of loop control and enabling more instruction-level parallelism. When unrolling a loop, we perform multiple operations per iteration, which increases the number of instructions per iteration but reduces the number of loop control checks.

I understand that excessive unrolling can lead to instruction cache misses so the unroll factor should take this into consideration , and having the unroll factor dividing the total number of iterations without a reminder(not sure how it would deal with the loop if it is uneven )

so my question is : In cases where the number of loop iterations is not known until runtime (i.e., it's stored in a variable determined during execution), how does the compiler decide whether to unroll the loop? it seems unlikely not to decide to unroll since if the number of iterations is high then the cpu performance will be much worse in comparison

asked Jul 6 at 18:12
\$\endgroup\$
13
  • 3
    \$\begingroup\$ Possibly helpful Stack Overflow questions : (a) How can GCC unroll a loop if its number of iterations is unknown at compile time? (b) How do optimizing compilers decide when and how much to unroll a loop? \$\endgroup\$ Commented Jul 6 at 18:17
  • 2
    \$\begingroup\$ Dareen: I specialized for a time in aspects of compiler optimization. I can refer you to some worthwhile ideas about data locality. For example, the old Bulldog compiler thesis discusses some attempts with respect to laying out data, more optimally. (This may not necessarily be laying out an array or matrix in the usual ways, but instead laying out the data to optimize access time, which is a different problem.) I am not going to write an answer. I'd need to regain comprehensiveness. Time I can't afford to offer. \$\endgroup\$ Commented Jul 6 at 19:12
  • 1
    \$\begingroup\$ Dareen: I'd recommend starting with getting familiarity with early thoughts from Bulldog. While glancing, some details are directly applicable, even today. For more modern treatments you may want to start here. It's worth some time to go through it. Somewhat less directly valuable, there is this, too to consider. And these two will also provide you with other excellent and highly related 'search phases' to use in your own investigations. \$\endgroup\$ Commented Jul 6 at 19:12
  • 4
    \$\begingroup\$ @dareen note that periblepsis reference is historically very interesting! For any commercial CPU built in the last 30 years, it's not even close to describing what the hardware is actually doing. So, I'd take it with a grain of salt. VLIW machines as sketched by that paper have spectacularly failed in real world (read up on the Itanium disaster), and we have superscalar, predictive, reordering CPU cores now. The reordering is what breaks the core assumptions of that (interesting!) paper. So, forget it if you want to learn about machines with complex instruction words(SIMD, conditional loads) \$\endgroup\$ Commented Jul 6 at 20:43
  • 2
    \$\begingroup\$ Although there are a number of people on this stack exchange who are both willing and able to answer this question, I really believe it is off-topic here, and belongs on softwareengineering.stackexchange.com and I hope an administrator will take note and move it. \$\endgroup\$ Commented Jul 7 at 17:44

3 Answers 3

8
\$\begingroup\$

I've read that loop unrolling can improve CPU performance by reducing the overhead of loop control and enabling more instruction-level parallelism.

yeah, sure, but it aside from thrashing instruction cashes, and reducing the probability of speculative execution being applicable, it also increases memory bandwidth and makes it harder for the compiler to reason about variable interdependencies, and it makes it harder for a CPU that internally renames registers to reason about what a good renaming scheme is.

Note that you're making assumptions here: Namely, that the upsides of unrolling dominate the downsides.

In cases where the number of loop iterations is not known until runtime (i.e., it's stored in a variable determined during execution), how does the compiler decide whether to unroll the loop?

Unless there's a very obvious case where the unrolling helps a lot for a loop length of 1, 2, or maybe 4, it usually just doesn't. Instead, compilers can auto-vectorize to some degree, and can combine multiple loop iterations into a sequence of SIMD operations. This can be a complex optimization pass – your mileage might vary, a lot, if you're not careful about the program you're writing.

(Yes, at least GCC and LLVM/clang on x86_64 do a good job at vectorizing, and then, if the loop count is not actually a multiple of the SIMD width, adding dynamically invoked "remainder" operations. Optimization really goes a long way, these days!)

it seems unlikely not to decide to unroll

I think the opposite is the case: On modern CPU cores with deep pipelines and a large register file, i.e., these that would actually benefit from unrolling, small conditional jumps are cheap, because they're highly optimized. In essence, is A still smaller than B? Then jump back 200 Bytes gets optimized away, because the control flow has been established: if A was smaller than B 10000 times, then the 100001. time, it still is. Sure, at some point, that will be wrong, but the longer the loop, the cheaper that, amortized per iteration. At the point of the check, the first instructions of the next iteration are already being computed.

That puts the "oh no, I need to check the loop condition" especially for numerical operations deep in the land of "not as scary as you think".

since if the number of iterations is high then the cpu performance will be much worse in comparison

No, the opposite is the case: Unrolling large loops is really bad for performance, at least on the superscalar, deep-pipeline CPUs of today where you'd want to unroll in the first place.

answered Jul 6 at 18:41
\$\endgroup\$
6
  • 1
    \$\begingroup\$ this is really helpful in our course we did not touch on any downside of unrolling \$\endgroup\$ Commented Jul 6 at 18:43
  • \$\begingroup\$ so i was mostly thinking positivily about it yet unrolling more than 8 or so i thought may have disadvantages \$\endgroup\$ Commented Jul 6 at 18:44
  • \$\begingroup\$ maybe a question about why unrolling large loops would be bad ? since i assumed size increases we are doing less cycles saving up on control instructions \$\endgroup\$ Commented Jul 6 at 18:46
  • 3
    \$\begingroup\$ @dareen rule of thumb: simply write a program that computes exactly what you need, and let the compiler handle optimization. Unless you're an expert in your target architecture (or can spend time looking at generated assembler and reasoning about it and what the compiler does), chances are you're not going to write faster programming language code than a straightforward implementation. \$\endgroup\$ Commented Jul 6 at 18:47
  • 3
    \$\begingroup\$ for large iteration counts, you're basically making no use of instruction cache if you unroll. That means your CPU can easily spend a large multiple of the time it computes on waiting for new instructions to be fetched from memory. A terrible deal. Also, as my answer explains: for large iteration counts, the one time mispredicted conditional jump at the end of the loop is amortized across all iterations, and for all but the last iteration, that jump simply takes no time, because the CPU already starts doing the next iteration before the check has finished. \$\endgroup\$ Commented Jul 6 at 18:49
6
\$\begingroup\$

As many have said, don't unroll loops unless you know what you and your compiler and CPU are doing.

It might help with old/simple computer architectures like PDP-11. Even in 8-bit MCUs like AVR, a tight single cycle loop body takes 2 cycles more for the loop jump instructions, so two thirds of CPU power is wasted on looping.

Which is why a programmer may come with an idea to unroll the loop manually to do more useful work if required and reduce the loop overhead.

See Duff's Device for a C implementation.

answered Jul 6 at 19:16
\$\endgroup\$
6
  • \$\begingroup\$ @Justme there isn't any. Case labels may appear anywhere within a switch statement (at any degree of nesting), jumping into loops is specifically allowed (6.8.6.1.3 isn't restricted to goto), and the loop is still required to loop. It's easy to read it as something invalid because of typography and because we forget that case labels are just labels and not anything like a scope or a compound-statement. \$\endgroup\$ Commented Jul 7 at 5:01
  • 1
    \$\begingroup\$ @hobbs Justme, you're right, I had in my mind that C89 forbade control structures interspersed with switch/case statements, but that's not the case. sorry! (It's still ugly code, a statement which I'm not going to apologize about, considering how ugly it is.) \$\endgroup\$ Commented Jul 7 at 8:20
  • \$\begingroup\$ @Justme maybe it's interesting to your answer: I sat down and proved myself I was wrong, by writing a minimal Duff's Device impl (doubles the unsiged chars in an array) and compiling it, for an AVR, using GCC: gcc.godbolt.org/z/sehGhEa84 . Takeaway is that for the tightest loop I could think of (do { *(array++) = (*array) + (*array); } while( length--);), for the normal loop, we get five instructions "loop overhead", add, adc, cp, cpc and finally brne. Each of them is 1 cycle, no memory access. Comparing that to the assembly of Duff's device doing the same, \$\endgroup\$ Commented Jul 7 at 8:46
  • \$\begingroup\$ note that we do get savings, I think, but I'm having a non-trivial time reading the assembly. At the end of each iteration, we do get 5 extra instructions to calculate the jump address in R30.R31/Z , but the thing is that I really don't know how expensive ijmp is on AVR – it is a memory-targetting instruction, so that'd be 2 cycles. And, now, for some reason I don't quite follow, AVR-GCC inserts an update involving R30.R31 and R26.R27 before every ld,add,st "payload" sequence, so we're saving, per iteration 5 cycles, gain 2, and add err I think 8 cycles every 8 iterations? Not sure \$\endgroup\$ Commented Jul 7 at 8:52
  • \$\begingroup\$ about the loop tail / head there. Which would be quite solid – saving 2 cycles per iteration on average. \$\endgroup\$ Commented Jul 7 at 8:53
0
\$\begingroup\$

On some processors, processing a loop as equivalent to:

 check loop condition
 if satisfied, goto exit
loop:
 do stuff
 check loop condition
 if satisfied, goto exit
 do stuff
 check loop condition
 if satisfied, goto exit
 do stuff
 check loop condition
 if satisfied, goto exit
 do stuff
 check loop condition
 if not satisfied, goto loop
 do stuff
exit:
 

may save a little bit of time for three out of every four loop iterations, generally without any performance downside beyond the increase in code size. The performance benefit will be minor. Even though clang will sometimes unroll loops dozens of times using this pattern, the majority of potential benefit that could be achieved this way will be achieved using 2x unrolling, and a majority of what remains could be achieved by 4x unrolling.

answered Jul 7 at 15:36
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.