reason why custom loop is faster? bad compiler? unsafe custom code? luck?(lucky cache hits)

Question 1

i just started learning assembly and making some custom loop for swapping two variables using C++ 's asm{} body with Digital-Mars compiler in C-Free 5.0

Enabled the -o (optimization)

And got the results:

 time of for-loop(cycles) 844
 time of while-loop(cycles) 735
 time of custom-loop-1(cycles) 562
 time of custom-loop-2(cycles) 469

i couldnt find Digital-Mars compiler "asm output" option to compare. There is no other optimisation options in the build options. Should i change my compiler? if yes, which one? Can you look at the codes below and tell me why custom loops are faster?

here is the standard for loop:

t1=clock(); 
for(int i=0;i<200000000;i++)
{
 temp=a;//instruction 1
 a=b;//instruction 2
 b=temp;//3 instructions total 
} 
t2=clock();
printf("\n time of for-loop(increasing) %i \n",(t2-t1));

here is the standard while loop:

t1=clock();
while(j<200000000)
{
 temp=a;//again it is three instructions
 a=b;
 b=temp; 
 j++;
}
t2=clock();
printf("\n time of while-loop(cycles) %i \n",(t2-t1));

here is my custom loop 1:

t1=clock();
j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax //backup
 push ebx //backup
 push ecx //backup
 push edx //backup
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 do_it_again: //begin to loop
 mov eax,a //basic swap steps between cpu and mem(cache)
 mov ebx,b 
 mov b,eax 
 mov a,ebx //four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again // end of loop block
 pop edx //rolling back to history 
 pop ecx 
 pop ebx 
 pop eax 
 popf 
 }
t2=clock();
printf("\n time of custom-loop-1(cycles) %i \n",(t2-t1));

here is my second custom loop:

t1=clock();
j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax 
 push ebx 
 push ecx 
 push edx 
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 mov eax,a //getting variables to registers
 mov ebx,b
 do_it_again2: //begin to loop
 //swapping with using only 2 variables(only in cpu)
 sub eax,ebx //a is now a-b
 add ebx,eax //b is now a
 sub eax,ebx //a is now -b
 xor eax,80000000h //a is now b and four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again2 // end of loop block
 pop edx //rollback
 pop ecx 
 pop ebx 
 pop eax 
 popf 
 }
t2=clock();
printf("\n time of custom-loop-2(cycles) %i \n",(t2-t1));

full code:

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
int main()
{
int j=0;
int a=0,b=0,temp=0;
srand(time(0));
time_t t1=0;
time_t t2=0;
t1=clock(); 
for(int i=0;i<200000000;i++)
{
 temp=a;//instruction 1
 a=b;//instruction 2
 b=temp;//3 instructions total 
} 
t2=clock();
printf("\n time of for-loop(cycles) %i \n",(t2-t1));
t1=clock();
while(j<200000000)
{
 temp=a;//again it is three instructions
 a=b;
 b=temp; 
 j++;
}
t2=clock();
printf("\n time of while-loop(cycles) %i \n",(t2-t1));
t1=clock();
j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax //backup
 push ebx //backup
 push ecx //backup
 push edx //backup
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 do_it_again: //begin to loop
 mov eax,a //basic swap steps between cpu and mem(cache)
 mov ebx,b 
 mov b,eax 
 mov a,ebx //four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again // end of loop block
 pop edx //rolling back to history 
 pop ecx 
 pop ebx 
 pop eax 
 popf 
 }
t2=clock();
printf("\n time of custom-loop-1(cycles) %i \n",(t2-t1));
t1=clock();
j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax 
 push ebx 
 push ecx 
 push edx 
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 mov eax,a //getting variables to registers
 mov ebx,b
 do_it_again2: //begin to loop
 //swapping with using only 2 variables(only in cpu)
 sub eax,ebx //a is now a-b
 add ebx,eax //b is now a
 sub eax,ebx //a is now -b
 xor eax,80000000h //a is now b and four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again2 // end of loop block
 pop edx //rollback
 pop ecx 
 pop ebx 
 pop eax 
 popf 
 }
t2=clock();
printf("\n time of custom-loop-2(cycles) %i \n",(t2-t1));
return 0;
}

i am just learning c++ and assembly and wondered how things going on. Thank you

windows xp, pentium 4 (2 GHz) Digital-Mars in C-Free

Question 2

Did you disassemble the code the compiler generated? That'd be at least a hint as to what it's spending its cycles on that your code isn't.

Question 3

Do you really build with optimizations on? I once did similar expirements with memcmp and expirienced that the compiled code was slightly faster besides calls etc.. Such questions are barely answereable and always specific for the concrete case.

Question 4

BTW, in this case, the while and for loops could conceivably be optimized away, and replaced with j=200000000; temp=b;, since the variables involved aren't objects or volatile and the iteration count is constant. Not sure why it isn't, other than maybe there being too much code still out there that likes to busy-wait...

Question 5

You might want to try passing -o+all to dmc to see if it makes a difference. The -cod option is supposed to generate an assembly file, but it appears to need the obj2asm tool which isn't in the free package as far as I can tell. You need to be careful with simple benchmarks - a good optimizer can eliminate the C loops altogether since there's no visible change in the variable states. For example, GCC gets rid of the loops entirely with -O2.

Question 6

@DanielKamilKozar: Why truncate a great quote. "We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil," Donald Knuth, Structured Programming with Goto Statements

Question 7

The code generated by that compiler is pretty horrible. After disassembling the object file with objconv, here's what I got in regards to the first for loop.

?_001: cmp dword [ebp-4H], 200000000 ; 0053 _ 81. 7D, FC, 0BEBC200
 jge ?_002 ; 005A _ 7D, 17
 inc dword [ebp-4H] ; 005C _ FF. 45, FC
 mov eax, dword [ebp-18H] ; 005F _ 8B. 45, E8
 mov dword [ebp-10H], eax ; 0062 _ 89. 45, F0
 mov eax, dword [ebp-14H] ; 0065 _ 8B. 45, EC
 mov dword [ebp-18H], eax ; 0068 _ 89. 45, E8
 mov eax, dword [ebp-10H] ; 006B _ 8B. 45, F0
 mov dword [ebp-14H], eax ; 006E _ 89. 45, EC
 jmp ?_001 ; 0071 _ EB, E0

The issues should be clear to anybody who ever looked at some assembly.

The loop is very tightly dependent on the value that is put in eax. This makes any out-of-order execution practically impossible due to dependencies created on that register by every next instruction.
There are six general-purpose registers available (since ebp and esp aren't really general-purpose in most of the setups), but your compiler uses none of them, falling back to using the local stack. This is absolutely unacceptable when speed is the optimization goal. We can even see that the current loop index is stored at [ebp-4H], while it could've been easily stored in a register.
The cmp instruction uses a memory and an immediate operand. This is the slowest possible mix of operands and should never be used when performance is at stake.
And don't get me started on the code size. Half of those instructions are just unnecessary.

All in all, the first thing I'd do is ditch that compiler at the earliest possible chance. But then again, seeing that it offers "memory models" as one of its options, one can't really seem to have much hope.

Question 8

is accessing to memory points which are not a multiple of 4, slower?

Question 9

Yes, alignment can be an issue - but in this case, the problem lies in a crappy compiler which doesn't seem to know any other registers than the accumulator.

Question 10

i have pentium m of a laptop. how many registers can i have? ax bx cx dx ex fx? or ei bi ci di like?

Question 11

Please read the Intel Basic Architecture Manual before running your compiler/assembler again.

Question 12

It's a bit hard to guess what your compiler may be doing without seeing the assembly language result it creates. With VC++ 10, I get the following results:

time of for-loop(cycles) 155
time of while-loop(cycles) 158
time of custom-loop-1(cycles) 369
time of custom-loop-2(cycles) 314

I didn't look at the output, but my immediate guess would be that the difference between the for and while loops is just noise. Both are obviously quite a bit faster than your hand-written assembly code though.

Edit: looking at the assembly code, I was right -- the code for the for and the while is identical. It looks like this:

 call _clock
 mov ecx, DWORD PTR _a$[ebp]
 cdq
 mov ebx, edx
 mov edx, DWORD PTR _b$[ebp]
 mov edi, eax
 mov esi, 200000000
$LL2@main:
; Line 28
 dec esi
; Line 30
 mov eax, ecx
; Line 31
 mov ecx, edx
; Line 32
 mov edx, eax
 jne SHORT $LL2@main
 mov DWORD PTR _b$[ebp], edx
 mov DWORD PTR _a$[ebp], ecx
; Line 35
 call _clock

While arguably less "clever" than your second loop, modern CPUs tend to do best with simple code. It also just has fewer instructions inside the loop (and doesn't reference memory inside the loop at all). Those aren't the sole measures of efficiency by any means, but with this simple of a loop, they're fairly indicative.

Edit 2:

Just for fun, I wrote a new version that adds the triple-XOR swap, as well as one using the CPU's xchg instruction (just because that's how I'd probably write it by hand if I didn't care much about speed, etc.) Though Intel/AMD generally recommend against the more complex instructions, it doesn't seem to cause a problem -- it seems to be coming out at least as fast as anything else:

 time of for-loop(cycles) 156
 time of while-loop(cycles) 160
 time swap between register and cache 284
 time to swap using add/sub: 308
 time to swap using xchg: 155
 time to swap using triple-xor 233

Source:

// Note: updated source -- it was just too ugly to live. Same results though.
#include<stdlib.h>
#include<time.h>
#include <iostream>
#include <string>
#include <iomanip>
#include <sstream>
namespace { 
 int a, b;
 const int loops = 200000000;
}
template <class swapper>
struct timer {
 timer(std::string const &label) { 
 clock_t t1 = clock();
 swapper()();
 clock_t t2 = clock();
 std::ostringstream buffer;
 buffer << "Time for swap using " << label;
 std::cout << std::left << std::setw(30) << buffer.str() << " = " << (t2-t1) << "\n";
 }
};
struct for_loop {
 void operator()() {
 int temp;
 for(int i=0;i<loops;i++) {
 temp=a;//instruction 1
 a=b;//instruction 2
 b=temp;//3 instructions total 
 }
 }
};
struct while_loop {
 void operator()() { 
 int j = 0;
 int temp;
 while(j<loops) {
 temp=a;//again it is three instructions
 a=b;
 b=temp; 
 j++;
 }
 }
};
struct reg_mem {
 void operator()() {
 int j=loops;//setting the count
 __asm {
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 do_it_again: //begin to loop
 mov eax,a //basic swap steps between cpu and mem(cache)
 mov ebx,b 
 mov b,eax 
 mov a,ebx //four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again // end of loop block
 }
 }
};
struct add_sub {
 void operator()() { 
 int j=loops;//setting the count
 __asm {
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 mov eax,a //getting variables to registers
 mov ebx,b
 do_it_again2: //begin to loop
 //swapping with using only 2 variables(only in cpu)
 sub eax,ebx //a is now a-b
 add ebx,eax //b is now a
 sub eax,ebx //a is now -b
 xor eax,80000000h //a is now b and four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again2 // end of loop block
 mov a, eax
 mov b, ebx
 }
 }
};
struct xchg {
 void operator()() {
 __asm {
 mov ecx, loops
 mov eax, a
 mov ebx, b
 do_it_again3:
 dec ecx
 xchg eax, ebx
 jne do_it_again3
 mov a, eax
 mov b, ebx
 }
 }
};
struct xor3 {
 void operator()() { 
 _asm { 
 mov ecx, loops
 mov eax, a
 mov edx, b
 do_swap4:
 xor eax, edx
 xor edx, eax
 xor eax, edx
 dec ecx
 jnz do_swap4
 mov a, eax
 mov b, edx
 }
 }
};
int main() {
 timer<for_loop>("for loop");
 timer<while_loop>("while loop");
 timer<reg_mem>("reg<->mem");
 timer<add_sub>("add/sub");
 timer<xchg>("xchg");
 timer<xor3>("triple xor");
 return 0;
}

Bottom line: at least for this trivial of a task, you're not going to beat a decent compiler by enough to care about (and probably not at all, except possibly in terms of minutely smaller code).

Question 13

then i will switch to vc++ 10. what is its size? several gigabytes or some megabytes?

Question 14

@tuğrulbüyükışık: offhand I don't remember it's size, but at least you can download it for free.

Question 15

vc++10 is 146 mb download but over 2 GB when extracts

Question 16

It's likely due the fact that the compiler fails to make it register-operands, working on indirect (address) operands instead.

Switch compilers <-- this is your best optimization.

Update I have gone through the trouble of translating the the same program gcc intel inline assembly: test.c . It clearly shows how the for-loop and and-while loop are vastly superior to the handwritten assembly.

That said, with Digital Mars, the following is faster:

__asm
{
 xor ecx,j //init of loop range(200000000 to 0)
 mov eax,a //getting variables to registers
 mov ebx,b
do_it_again3: //begin to loop
 //swapping with xor idiom
 xor eax,ebx
 xor ebx,eax 
 xor eax,ebx 
 mov a,eax
 mov b,ebx
 dec ecx // j--
 jnz do_it_again3 // end of loop block
}

using

the XOR swap idiom
descending loop
implicit comparison flags (with dec ecx)

My benchmark with Digital Mars Compiler Version 8.42n results in:

time of for-loop(cycles) 572 
time of while-loop(cycles) 566 
time of custom-loop-1(cycles) 355 
time of custom-loop-2(cycles) 317 
time of custom-loop-3(cycles) 234

Full listing:

#include<stdio.h>
#include<stdlib.h>
#include<time.h>
int main()
{
 int j=0;
 int a=0,b=0,temp=0;
 srand(time(0));
 time_t t1=0;
 time_t t2=0;
 t1=clock();
 for(int i=0; i<200000000; i++)
 {
 temp=a;//instruction 1
 a=b;//instruction 2
 b=temp;//3 instructions total
 }
 t2=clock();
 printf("\n time of for-loop(cycles) %i \n",(t2-t1));
 t1=clock();
 while(j<200000000)
 {
 temp=a;//again it is three instructions
 a=b;
 b=temp;
 j++;
 }
 t2=clock();
 printf("\n time of while-loop(cycles) %i \n",(t2-t1));
 t1=clock();
 j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax //backup
 push ebx //backup
 push ecx //backup
 push edx //backup
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 do_it_again: //begin to loop
 mov eax,a //basic swap steps between cpu and mem(cache)
 mov ebx,b
 mov b,eax
 mov a,ebx //four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again // end of loop block
 pop edx //rolling back to history
 pop ecx
 pop ebx
 pop eax
 popf
 }
 t2=clock();
 printf("\n time of custom-loop-1(cycles) %i \n",(t2-t1));
 t1=clock();
 j=200000000;//setting the count
 __asm
 {
 pushf //backup
 push eax 
 push ebx 
 push ecx 
 push edx 
 mov ecx,0 //init of loop range(0 to 200000000)
 mov edx,j
 mov eax,a //getting variables to registers
 mov ebx,b
 do_it_again2: //begin to loop
 //swapping with using only 2 variables(only in cpu)
 sub eax,ebx //a is now a-b
 add ebx,eax //b is now a
 sub eax,ebx //a is now -b
 xor eax,80000000h //a is now b and four instructions total
 inc ecx // j++
 cmp ecx,edx //i<200000000 ?
 jb do_it_again2 // end of loop block
 pop edx //rollback
 pop ecx 
 pop ebx 
 pop eax 
 popf 
 }
 t2=clock();
 printf("\n time of custom-loop-2(cycles) %i \n",(t2-t1));
 t1=clock();
 j=200000000;//setting the count
 __asm
 {
 xor ecx,j //init of loop range(200000000 to 0)
 mov eax,a //getting variables to registers
 mov ebx,b
 do_it_again3: //begin to loop
 //swapping with using only 2 variables(only in cpu)
 xor eax,ebx
 xor ebx,eax 
 xor eax,ebx 
 mov a,eax
 mov b,ebx
 dec ecx // j--
 jnz do_it_again3 // end of loop block
 }
 t2=clock();
 printf("\n time of custom-loop-3(cycles) %i \n",(t2-t1));
 return 0;
}

Question 17

Just use gcc. Here is the same program translated to gcc intel inline assembly: gist.github.com/3131989#file_test.c It clearly shows how the for-loop and and-while loop are vastly superior to the handwritten assembly

Question 18

@tuğrulbüyükışık You can safely assume about 90% of all CPUs run on ~3GHz as it has been the manufacturing limit for quite a while now (excluding mobile CPUs)

Question 19

i am in that %10 f<3GHz so %90 threw their 2GHz to trash bin?

Question 20

Woha whatever that compiler is doing, that's quite horrendous. The xor idiom was nice ten years ago, with the pipelining today all those dependencies should be much slower than the straight forward code with register renaming.

Question 21

I'm surprised that any of you guys got anything but zero cycles from the C code. Here, with gcc 4.6.3 and -O2, the loop vanishes away as there is no side-effect from it. Everything except the asm block is removed. I would be surprised if Digital Mars can't do such a trivial optimization; I bet you can try different optimization switches that will remove the C code, at which point such trivial comparison becomes impossible.

Your toy example is useless to compare compiler optimizations with hand-crafted assembly. Statistically speaking, compilers can consistently write better machine code than humans.

Question 22

Up-voted despite condescending tone, because this is a valid point. Have several times encountered this when toying with some benchmarking, GCC notices that code in fact has no side-effect and the fastest code is code not run.

Question 23

That's normal and changing the compiler wouldn't solve this "problem". Assembler is extremely low-level and you have control of everything. Your C++ compiler always does more than it needs. Calling a function would take more time than it would take in assembly, because the compiler protects the stack (for example). And in loop that's the same thing: Declare a new variable takes more time, add values also etc...

This question should be interesting for some more information: When is assembler faster than C?

Question 24

A straight C++ -> assembly translation actually does do far more than it strictly needs to, but because of that fact, it's pretty much only done in debug builds. For release, optimizations are almost invariably turned on...and C++ compilers typically optimize very, very well. Results can vary, but assembly language is not inherently faster just because it's low-level -- even somewhat-decent-but-not-optimal assembly language will often be slower than C++ that's been fully optimized by a good compiler.

Daniel Kamil Kozar 19.5k5 gold badges53 silver badges70 bronze badges · Accepted Answer · 2012-07-17 20:29:12Z

The code generated by that compiler is pretty horrible. After disassembling the object file with objconv, here's what I got in regards to the first for loop.

?_001: cmp dword [ebp-4H], 200000000 ; 0053 _ 81. 7D, FC, 0BEBC200
 jge ?_002 ; 005A _ 7D, 17
 inc dword [ebp-4H] ; 005C _ FF. 45, FC
 mov eax, dword [ebp-18H] ; 005F _ 8B. 45, E8
 mov dword [ebp-10H], eax ; 0062 _ 89. 45, F0
 mov eax, dword [ebp-14H] ; 0065 _ 8B. 45, EC
 mov dword [ebp-18H], eax ; 0068 _ 89. 45, E8
 mov eax, dword [ebp-10H] ; 006B _ 8B. 45, F0
 mov dword [ebp-14H], eax ; 006E _ 89. 45, EC
 jmp ?_001 ; 0071 _ EB, E0

The issues should be clear to anybody who ever looked at some assembly.

The loop is very tightly dependent on the value that is put in eax. This makes any out-of-order execution practically impossible due to dependencies created on that register by every next instruction.
There are six general-purpose registers available (since ebp and esp aren't really general-purpose in most of the setups), but your compiler uses none of them, falling back to using the local stack. This is absolutely unacceptable when speed is the optimization goal. We can even see that the current loop index is stored at [ebp-4H], while it could've been easily stored in a register.
The cmp instruction uses a memory and an immediate operand. This is the slowest possible mix of operands and should never be used when performance is at stake.
And don't get me started on the code size. Half of those instructions are just unnecessary.

All in all, the first thing I'd do is ditch that compiler at the earliest possible chance. But then again, seeing that it offers "memory models" as one of its options, one can't really seem to have much hope.

is accessing to memory points which are not a multiple of 4, slower?
Yes, alignment can be an issue - but in this case, the problem lies in a crappy compiler which doesn't seem to know any other registers than the accumulator.
i have pentium m of a laptop. how many registers can i have? ax bx cx dx ex fx? or ei bi ci di like?
Please read the Intel Basic Architecture Manual before running your compiler/assembler again.

CollectivesTM on Stack Overflow

reason why custom loop is faster? bad compiler? unsafe custom code? luck?(lucky cache hits)

5 Answers 5

4 Comments

3 Comments

4 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

4 Comments

3 Comments

4 Comments

1 Comment

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related