Why is this += loop faster than the equivalent = loop?

Question 1

I wrote this simple loop in C++ (Visual Studio 2019 in Release mode) that goes through one million 32-bit integers and assigns each to the previous:

 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] = ints32[i];

And here's a loop that's the same except it uses +=:

 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] += ints32[i];

I'm measuring both of these functions by running them 20 times, discarding the 5 slowest and 5 fastest runs, and averaging the rest. (Before each measurement, I fill the array with random integers.) I find that the assign-only loop takes around 460 microseconds, but the loop with addition takes 320 microseconds. These measurements are consistent across multiple runs (they vary a bit, but the addition is always faster), even when I change the order in which I measure the two functions.

Why is the loop with addition faster than the loop without addition? I would think the addition would make it take longer, if anything.

Here is the disassembly, and you can see that the functions are equivalent except that the addition loop does more work (and sets eax to ints[i - 1] instead of ints[i]).

 for ( int i = 1; i < Iters; i++ )
00541670 B8 1C 87 54 00 mov eax,54871Ch 
 ints32[i - 1] = ints32[i];
00541675 8B 08 mov ecx,dword ptr [eax] 
00541677 89 48 FC mov dword ptr [eax-4],ecx 
0054167A 83 C0 04 add eax,4 
0054167D 3D 18 90 91 00 cmp eax,offset floats32 (0919018h) 
00541682 7C F1 jl IntAssignment32+5h (0541675h) 
}
00541684 C3 ret 
 for ( int i = 1; i < Iters; i++ )
00541700 B8 18 87 54 00 mov eax,offset ints32 (0548718h) 
 ints32[i - 1] += ints32[i];
00541705 8B 08 mov ecx,dword ptr [eax] 
00541707 03 48 04 add ecx,dword ptr [eax+4] 
0054170A 89 08 mov dword ptr [eax],ecx 
0054170C 83 C0 04 add eax,4 
0054170F 3D 14 90 91 00 cmp eax,919014h 
00541714 7C EF jl IntAddition32+5h (0541705h) 
}
00541716 C3 ret

(The int array is volatile because I didn't want the compiler to optimize it away or vectorize it or anything, and indeed the disassembly it's producing is what I wanted to measure.)

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

I'm using the default Visual Studio 2019 Win32 Release compiler options (copied this from the project properties):

/permissive- /ifcOutput "Release\" /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\profile.pch" /diagnostics:column

Compiler version: Microsoft Visual Studio Community 2019 Version 16.11.15

Here's the complete code:

#include <iostream>
#include <chrono>
#include <random>
#include <algorithm>
const int ValueRange = 100000000;
std::default_random_engine generator;
std::uniform_int_distribution< int > distribution( 1, ValueRange - 1 );
const int Iters = 1000000; // nanoseconds -> milliseconds
volatile int ints32[Iters];
void InitArrayInt32()
{
 for ( int i = 0; i < Iters; i++ )
 ints32[i] = distribution( generator );
}
const int SampleCount = 20;
const int KeepSampleCount = SampleCount - 2 * (SampleCount / 4);
float ProfileFunction( void(*setup)(), void(*func)() )
{
 uint64_t times[SampleCount];
 for ( int i = 0; i < SampleCount; i++ )
 {
 setup();
 auto startTime = std::chrono::steady_clock::now();
 func();
 auto endTime = std::chrono::steady_clock::now();
 times[i] = std::chrono::duration_cast<std::chrono::microseconds>( endTime - startTime ).count();
 }
 std::sort( times, times + SampleCount );
 uint64_t total = 0;
 for ( int i = SampleCount / 4; i < SampleCount - SampleCount / 4; i++ )
 total += times[i];
 return total * (1.0f / KeepSampleCount);
}
void IntAssignment32()
{
 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] = ints32[i];
}
void IntAddition32()
{
 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] += ints32[i];
}
int main()
{
 float assignment = ProfileFunction( InitArrayInt32, IntAssignment32 );
 float addition = ProfileFunction( InitArrayInt32, IntAddition32 );
 printf( "assignment: %g\n", assignment );
 printf( "addition: %g\n", addition );
 return 0;
}

Question 2

I'm measuring both of these functions by running them 20 times how exactly do you measure the loops? Also, what compilers and options do you use here?

Question 3

@TheDreamsWind ok, I overlooked that. Still, I'm voting for checking cache effect: the first code loads at [eax] then at [eax-4], thus cache is likely loaded from [eax]; whereas in the second case, the reads are first [eax] and then [eax+4], i.e., in order.

Question 4

When doing microbenchmarking, a full minimal reproducible example is needed. Because there are thousands of ways to mess up a microbenchmark, and by far the most likely reason why you get the results you do is artifacts of how you did your microbenchmark. Most people assume they did it right, and most people are wrong; the odds you are the one person who didn't mess up a microbenchmark should be unlikely even to yourself.

Question 5

While attempting to write an answer to some of these questions, I noticed that changing seemingly unrelated things about the program (like the printf that shows me the results) was causing the results to change. I got better performance in the assignment loop (280 microseconds), then it switched back to slow and to fast again as I changed things. I suspect it might be related to the alignment of the function in memory. Note in the disassembly I posted, the assignment function straddles a 128 byte boundary. So maybe that's the problem. Is there a way to make VS align functions better?

Question 6

Your program is pretty fast for a benchmark. The frequency of modern processors is dynamic so it is better to have a loop in the main function. Something like >=5 iteration. At least to see if timings are stable during a given execution and between so to understand possible sources of variation. On my i5-9600KF PC with GCC (O3) I got consistent results: the addition is always slower (10-15%).

Question 7

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

The reason why one is faster then the other is luck. The variance in speed caused by the alignment of the code and the data has far more influence than the minor difference in the generated code.

Even though you measure multiple times what you measure is always the same alignment (to some degree if you have address space randomization). This eliminates the noise from the scheduler and interrupts but not the noise from alignment changes.

But when you change something in the code you change the alignment, e.g. shift the array by 4 byte. Of shift the code by 1 byte. And that has actually a larger impact to the speed than the difference between -O0 and -O2 in gcc. There are lots of papers on this effect and more.

Goswin von Brederlow 12.4k2 gold badges27 silver badges47 bronze badges · Accepted Answer · 2022-07-03 23:47:06Z

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

The reason why one is faster then the other is luck. The variance in speed caused by the alignment of the code and the data has far more influence than the minor difference in the generated code.

Even though you measure multiple times what you measure is always the same alignment (to some degree if you have address space randomization). This eliminates the noise from the scheduler and interrupts but not the noise from alignment changes.

But when you change something in the code you change the alignment, e.g. shift the array by 4 byte. Of shift the code by 1 byte. And that has actually a larger impact to the speed than the difference between -O0 and -O2 in gcc. There are lots of papers on this effect and more.

CollectivesTM on Stack Overflow

Why is this += loop faster than the equivalent = loop?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related