2

I wrote this simple loop in C++ (Visual Studio 2019 in Release mode) that goes through one million 32-bit integers and assigns each to the previous:

 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] = ints32[i];

And here's a loop that's the same except it uses +=:

 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] += ints32[i];

I'm measuring both of these functions by running them 20 times, discarding the 5 slowest and 5 fastest runs, and averaging the rest. (Before each measurement, I fill the array with random integers.) I find that the assign-only loop takes around 460 microseconds, but the loop with addition takes 320 microseconds. These measurements are consistent across multiple runs (they vary a bit, but the addition is always faster), even when I change the order in which I measure the two functions.

Why is the loop with addition faster than the loop without addition? I would think the addition would make it take longer, if anything.

Here is the disassembly, and you can see that the functions are equivalent except that the addition loop does more work (and sets eax to ints[i - 1] instead of ints[i]).

 for ( int i = 1; i < Iters; i++ )
00541670 B8 1C 87 54 00 mov eax,54871Ch 
 ints32[i - 1] = ints32[i];
00541675 8B 08 mov ecx,dword ptr [eax] 
00541677 89 48 FC mov dword ptr [eax-4],ecx 
0054167A 83 C0 04 add eax,4 
0054167D 3D 18 90 91 00 cmp eax,offset floats32 (0919018h) 
00541682 7C F1 jl IntAssignment32+5h (0541675h) 
}
00541684 C3 ret 
 for ( int i = 1; i < Iters; i++ )
00541700 B8 18 87 54 00 mov eax,offset ints32 (0548718h) 
 ints32[i - 1] += ints32[i];
00541705 8B 08 mov ecx,dword ptr [eax] 
00541707 03 48 04 add ecx,dword ptr [eax+4] 
0054170A 89 08 mov dword ptr [eax],ecx 
0054170C 83 C0 04 add eax,4 
0054170F 3D 14 90 91 00 cmp eax,919014h 
00541714 7C EF jl IntAddition32+5h (0541705h) 
}
00541716 C3 ret 

(The int array is volatile because I didn't want the compiler to optimize it away or vectorize it or anything, and indeed the disassembly it's producing is what I wanted to measure.)


Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

I'm using the default Visual Studio 2019 Win32 Release compiler options (copied this from the project properties):

/permissive- /ifcOutput "Release\" /GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /sdl /Fd"Release\vc142.pdb" /Zc:inline /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /FC /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\profile.pch" /diagnostics:column 

Compiler version: Microsoft Visual Studio Community 2019 Version 16.11.15

Here's the complete code:

#include <iostream>
#include <chrono>
#include <random>
#include <algorithm>
const int ValueRange = 100000000;
std::default_random_engine generator;
std::uniform_int_distribution< int > distribution( 1, ValueRange - 1 );
const int Iters = 1000000; // nanoseconds -> milliseconds
volatile int ints32[Iters];
void InitArrayInt32()
{
 for ( int i = 0; i < Iters; i++ )
 ints32[i] = distribution( generator );
}
const int SampleCount = 20;
const int KeepSampleCount = SampleCount - 2 * (SampleCount / 4);
float ProfileFunction( void(*setup)(), void(*func)() )
{
 uint64_t times[SampleCount];
 for ( int i = 0; i < SampleCount; i++ )
 {
 setup();
 auto startTime = std::chrono::steady_clock::now();
 func();
 auto endTime = std::chrono::steady_clock::now();
 times[i] = std::chrono::duration_cast<std::chrono::microseconds>( endTime - startTime ).count();
 }
 std::sort( times, times + SampleCount );
 uint64_t total = 0;
 for ( int i = SampleCount / 4; i < SampleCount - SampleCount / 4; i++ )
 total += times[i];
 return total * (1.0f / KeepSampleCount);
}
void IntAssignment32()
{
 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] = ints32[i];
}
void IntAddition32()
{
 for ( int i = 1; i < Iters; i++ )
 ints32[i - 1] += ints32[i];
}
int main()
{
 float assignment = ProfileFunction( InitArrayInt32, IntAssignment32 );
 float addition = ProfileFunction( InitArrayInt32, IntAddition32 );
 printf( "assignment: %g\n", assignment );
 printf( "addition: %g\n", addition );
 return 0;
}
asked Jul 3, 2022 at 22:15
12
  • 3
    I'm measuring both of these functions by running them 20 times how exactly do you measure the loops? Also, what compilers and options do you use here? Commented Jul 3, 2022 at 22:28
  • 1
    @TheDreamsWind ok, I overlooked that. Still, I'm voting for checking cache effect: the first code loads at [eax] then at [eax-4], thus cache is likely loaded from [eax]; whereas in the second case, the reads are first [eax] and then [eax+4], i.e., in order. Commented Jul 3, 2022 at 22:31
  • 5
    When doing microbenchmarking, a full minimal reproducible example is needed. Because there are thousands of ways to mess up a microbenchmark, and by far the most likely reason why you get the results you do is artifacts of how you did your microbenchmark. Most people assume they did it right, and most people are wrong; the odds you are the one person who didn't mess up a microbenchmark should be unlikely even to yourself. Commented Jul 3, 2022 at 22:48
  • 1
    While attempting to write an answer to some of these questions, I noticed that changing seemingly unrelated things about the program (like the printf that shows me the results) was causing the results to change. I got better performance in the assignment loop (280 microseconds), then it switched back to slow and to fast again as I changed things. I suspect it might be related to the alignment of the function in memory. Note in the disassembly I posted, the assignment function straddles a 128 byte boundary. So maybe that's the problem. Is there a way to make VS align functions better? Commented Jul 3, 2022 at 23:01
  • 1
    Your program is pretty fast for a benchmark. The frequency of modern processors is dynamic so it is better to have a loop in the main function. Something like >=5 iteration. At least to see if timings are stable during a given execution and between so to understand possible sources of variation. On my i5-9600KF PC with GCC (O3) I got consistent results: the addition is always slower (10-15%). Commented Jul 3, 2022 at 23:38

1 Answer 1

1

Edit: I'm noticing that when I change seemingly unrelated things about the program, the assignment version gets faster, and then slower again. I suspect it may be related to the alignment of the function's code, maybe?

The reason why one is faster then the other is luck. The variance in speed caused by the alignment of the code and the data has far more influence than the minor difference in the generated code.

Even though you measure multiple times what you measure is always the same alignment (to some degree if you have address space randomization). This eliminates the noise from the scheduler and interrupts but not the noise from alignment changes.

But when you change something in the code you change the alignment, e.g. shift the array by 4 byte. Of shift the code by 1 byte. And that has actually a larger impact to the speed than the difference between -O0 and -O2 in gcc. There are lots of papers on this effect and more.

answered Jul 3, 2022 at 23:47
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.