Calculating the harmonic average

Question 1

In this program I was required to calculate the Harmonic Average using an assembly program with a C driver.

Harmonic mean is defined as:

$$ \frac{n}{\dfrac{1}{x_1} + \dfrac{1}{x_2} + \cdots + \dfrac{1}{x_n}}$$

where \$n = \$ number of inputs and \$x_i = \$ a real number.

I would like to know if:

My code follows common practices.
My code is formatted correctly.
If comments need to be improved.
My code is straightforward; that is, I have achieved the goal with the least amount of work.
There any improvements I should make in my code.

Attached below is my program in Assembly:

;Assembly function that computs the harmonic mean
;of an array of 64-bit floating-point numbers.
;Retrieves input using a C program.
;
;Harmonic mean is defined as Sum(n/((1/x1) + (1/x2) + ... + (1/xn)))
;
; expects:
; RDI - address of array
; RSI - length of the array
; returns
; XMMO - the harmonic average of array's values
global harmonicMean
section .data
 Zero dd 0.0
 One dd 1.0
section .text
 harmonicMean:
 push rbp
 mov rbp, rsp ;C prologue
 movss xmm10, [Zero] ;Holds tally of denominator
 cvtsi2ss xmm0, rsi ;Take length and put it into xmm0 register
 .whileLoop:
 cmp rsi, 0 ;Is the length of array 0?
 je .endwhile
 call addDen ;Compute a denominator value and add it to sum
 add rdi, 4 ;Add size of float to address
 dec rsi ;Decrease the length
 jmp .whileLoop
 .endwhile:
 divss xmm0, xmm10
 leave
 ret
 ;Calculates a number in the denominator
 addDen:
 push rdi
 movss xmm8, [One]
 movss xmm9, [rdi]
 divss xmm8, xmm9
 addss xmm10, xmm8
 pop rdi
 ret

Question 2

Can you use SSE2 instructions or only SSE?

Question 3

I am fairly new to Assembly. I have necer heard of SSE2 so I'm going to assume only SSE?

Question 4

We are were using linux 64-bit machines if that's of any use.

Question 5

Here are some things that may help you improve your code.

Omit C prolog and epilog if possible

The part you have correctly labeled "C prologue" in your code is used by the compiler to be able to access local variables. However both inputs and outputs to this assembly routine are in registers, so no stack manipulations are needed. For that reason, we can eliminate both in this code.

Don't `push`/`pop` registers you don't alter

In the addDen routine, the rdi register is pushed and then popped off the stack, but not altered by the routine. As a result, neither push nor pop are required and both could be omitted.

Avoid subroutine calls

Subroutine calls are not good for fast assembly code. Your addDen code doesn't need to be a subroutine -- it can be placed inline, avoiding the overhead of a subroutine call.

Reduce jumps per loop iteration

The code as currently structured has both a comparison and conditional jump and also an unconditional jmp .whileLoop that both occur within each loop iteration. Better practice is to minimize the number of jumps within each loop; ideally one should strive to reduce it to exactly one.

Use direct memory access for instructions

The current code uses this instruction sequence

 movss xmm8, [One]
 movss xmm9, [rdi]
 divss xmm8, xmm9

However, the divss instruction can use memory directly as a source, so we can shorten this to:

 movss xmm8, [One]
 divss xmm8, [rdi]

This saves us a register (xmm9) and some time.

Avoid compare instructions if practical

The current code uses cmp rsi, 0 but that's not really needed since the dec esi at the end of the loop already sets up the zero flag. Putting all of the previous hints together, a revised routine looks like this:

section .data
 harmonicMean2:
 cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
 movss xmm10, [Zero] ; xmm10 = denominator
 inc rsi
 jmp .testzero
 .more:
 movss xmm8, [One]
 divss xmm8, [rdi]
 add rdi, 4
 addss xmm10, xmm8
 .testzero:
 dec rsi
 jnz .more
 divss xmm0, xmm10
 ret

Take advantage of SIMD where practical

SIMD stands for Single Instruction, Multiple Data which allows the CPU to execute multiple operations in parallel, speeding up the routine. Because your code calculate reciprocals of each data item in the list, we can do that calculation in parallel (4 at a time using SSE) and save some time. If there aren't 4 data items, we can drop back to a simple serial approach as with what you've already written. Here's one way to write such a routine:

section .data
 One dd 1.0, 1.0, 1.0, 1.0
section .data
 harmonicMeanP:
 cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
 xorps xmm10, xmm10 ; xmm10 = denominator = 0
 mov rax, rsi
 shr rax, 2
 jz .single
 movups xmm9, [One] ; Loads series of 1s in 
 .doquad:
 movaps xmm8, xmm9 ; copy those ones
 divps xmm8, [rdi] ; fetch 4 floats
 add rdi, 16
 addps xmm10, xmm8
 dec rax
 jnz .doquad
 movaps xmm8, xmm10 ; copy xmm10
 shufps xmm8, xmm10, 04ドルE ; swap low and high halves
 addps xmm8, xmm10 ; sum now in xmm8
 movaps xmm10, xmm8 ; copy xmm8
 shufps xmm10, xmm8, 0ドルB1 ; swap pairs into xmm10
 addps xmm10, xmm8 ; now have 4 copies of sum in xmm10
 and rsi, 3 ; less than 4 left
 .single:
 inc rsi
 jmp .testzero
 .more:
 movss xmm8, [One]
 divss xmm8, [rdi]
 add rdi, 4
 addss xmm10, xmm8
 .testzero:
 dec rsi
 jnz .more
 divss xmm0, xmm10
 ret

Test your results

With assembly language code like this, as with much code, the two things one generally looks for are correctness and speed.

Correctness

Although the algorithm is essentially identical in all versions, there is subtle difference that is inherent with the addition of floating point numbers. Specifically, with floating point numbers, adding a large number to a small number can lead to loss of precision. In extreme cases, if a is very large and b is very small, we can have a + b = a which is obviously incorrect for non-zero values of b. (See the oft-cited and still excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg for a readable technical discussion for why this is so.) This can affect the harmonic mean calculations. For example, if we calculate the harmonic mean of the numbers \1ドル \cdots 100000\,ドル in that order, the correct value is approximately 8271.198621. However, the original code reports 8270.716797 which is off by a bit. The reason is that by the time we get to 100000, we're adding a very small number (1e-5) to a relatively much bigger number of about 12.09. One can use doubles instead of floats here, but while that helps, it only postpones the inevitable when dealing with floating point addition. Numerical analysis is important with these kinds of calculations to assure that the answers are as accurate as they need to be.

Speed

The desire for speed can sometimes conflict with the other goal of correctness, but in this case, we can serve both interests. When the calculation is parallelized, the effect is that we calculate 4 reciprocals and add 4 results at a time to 4 separate sums. Because the SSE instructions take about the same amount of time to do one division as to do four in parallel (or 8 in parallel if you can use SSE2 or newer) the results are obtained about 4x as quickly (or 8x with SSE2). This is where the real speed dividends are, and there's also potentially a small increase in accuracy because each sum is only (roughly) 25% of the total sum (given randomly distributed numbers as input), the loss of precision is slightly reduced.
Here is the C++ program I used to test all variations, including a pure C++ variant that uses doubles for calculation instead of float.

drivemean.cpp

The following code was used to evaluate the various versions for both speed and accuracy. The stopwatch.h code from this question and was designed for exactly this kind of purpose.

 #include <algorithm>
 #include <iostream>
 #include <iomanip>
 #include "stopwatch.h"
 extern "C" float harmonicMean(float *arr, unsigned len);
 extern "C" float harmonicMean2(float *arr, unsigned len);
 extern "C" float harmonicMeanP(float *arr, unsigned len);
 float hmean(float *arr, unsigned len)
 {
 double sum = 0;
 double n = len;
 for ( ; len; --len, ++arr)
 sum += 1.0/(*arr);
 return n/sum;
 }
 struct MeanTest
 {
 float (*calc)(float *, unsigned);
 std::string name;
 };
 static MeanTest test[]{ 
 {harmonicMean, "original" },
 {harmonicMean2, "tweaked" },
 {harmonicMeanP, "parallel" },
 {hmean, "C++ double" },
 {nullptr, ""},
 };
 int main()
 {
 const unsigned arrlen{100000};
 const unsigned iterations{1000};
 float nums[arrlen];
 float hm;
 std::iota(nums, nums+arrlen, 1);
 Stopwatch<std::chrono::milliseconds> sw;
 for (MeanTest *t = test; t->calc != nullptr; ++t) {
 sw.start();
 for (unsigned i = iterations; i; --i)
 hm = t->calc(nums, arrlen);
 sw.stop();
 std::cout << std::setprecision(10) << hm << ", " 
 << t->name << " time = " << sw.elapsed() << " ms\n";
 }
 }

Results

 8270.716797, original time = 496 ms
 8270.716797, tweaked time = 431 ms
 8271.209961, parallel time = 108 ms
 8271.198242, C++ double time = 675 ms

As you can see, the C++ version was slowest, but most accurate (only off by about 0.0004) but the parallel version was second most accurate but 4x faster than the next fastest.

Using SSE2 instructions would allow one to approach 8x faster and would also allow for some simplification in the use of the shufps instruction. I'll leave that fun for you.

Question 6

This is a well put together answer and I greatly appreciate it. I have put most of your suggestions into my program (aside from SIMD usage) and will keep your suggestions in mind in any following Assembly programs! I thank you again for your input.

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2014-12-04 16:14:13Z

Here are some things that may help you improve your code.

Omit C prolog and epilog if possible

The part you have correctly labeled "C prologue" in your code is used by the compiler to be able to access local variables. However both inputs and outputs to this assembly routine are in registers, so no stack manipulations are needed. For that reason, we can eliminate both in this code.

Don't `push`/`pop` registers you don't alter

In the addDen routine, the rdi register is pushed and then popped off the stack, but not altered by the routine. As a result, neither push nor pop are required and both could be omitted.

Avoid subroutine calls

Subroutine calls are not good for fast assembly code. Your addDen code doesn't need to be a subroutine -- it can be placed inline, avoiding the overhead of a subroutine call.

Reduce jumps per loop iteration

The code as currently structured has both a comparison and conditional jump and also an unconditional jmp .whileLoop that both occur within each loop iteration. Better practice is to minimize the number of jumps within each loop; ideally one should strive to reduce it to exactly one.

Use direct memory access for instructions

The current code uses this instruction sequence

 movss xmm8, [One]
 movss xmm9, [rdi]
 divss xmm8, xmm9

However, the divss instruction can use memory directly as a source, so we can shorten this to:

 movss xmm8, [One]
 divss xmm8, [rdi]

This saves us a register (xmm9) and some time.

Avoid compare instructions if practical

The current code uses cmp rsi, 0 but that's not really needed since the dec esi at the end of the loop already sets up the zero flag. Putting all of the previous hints together, a revised routine looks like this:

section .data
 harmonicMean2:
 cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
 movss xmm10, [Zero] ; xmm10 = denominator
 inc rsi
 jmp .testzero
 .more:
 movss xmm8, [One]
 divss xmm8, [rdi]
 add rdi, 4
 addss xmm10, xmm8
 .testzero:
 dec rsi
 jnz .more
 divss xmm0, xmm10
 ret

Take advantage of SIMD where practical

SIMD stands for Single Instruction, Multiple Data which allows the CPU to execute multiple operations in parallel, speeding up the routine. Because your code calculate reciprocals of each data item in the list, we can do that calculation in parallel (4 at a time using SSE) and save some time. If there aren't 4 data items, we can drop back to a simple serial approach as with what you've already written. Here's one way to write such a routine:

section .data
 One dd 1.0, 1.0, 1.0, 1.0
section .data
 harmonicMeanP:
 cvtsi2ss xmm0, rsi ; xmm0 = len = numerator
 xorps xmm10, xmm10 ; xmm10 = denominator = 0
 mov rax, rsi
 shr rax, 2
 jz .single
 movups xmm9, [One] ; Loads series of 1s in 
 .doquad:
 movaps xmm8, xmm9 ; copy those ones
 divps xmm8, [rdi] ; fetch 4 floats
 add rdi, 16
 addps xmm10, xmm8
 dec rax
 jnz .doquad
 movaps xmm8, xmm10 ; copy xmm10
 shufps xmm8, xmm10, 04ドルE ; swap low and high halves
 addps xmm8, xmm10 ; sum now in xmm8
 movaps xmm10, xmm8 ; copy xmm8
 shufps xmm10, xmm8, 0ドルB1 ; swap pairs into xmm10
 addps xmm10, xmm8 ; now have 4 copies of sum in xmm10
 and rsi, 3 ; less than 4 left
 .single:
 inc rsi
 jmp .testzero
 .more:
 movss xmm8, [One]
 divss xmm8, [rdi]
 add rdi, 4
 addss xmm10, xmm8
 .testzero:
 dec rsi
 jnz .more
 divss xmm0, xmm10
 ret

Test your results

With assembly language code like this, as with much code, the two things one generally looks for are correctness and speed.

Correctness

Although the algorithm is essentially identical in all versions, there is subtle difference that is inherent with the addition of floating point numbers. Specifically, with floating point numbers, adding a large number to a small number can lead to loss of precision. In extreme cases, if a is very large and b is very small, we can have a + b = a which is obviously incorrect for non-zero values of b. (See the oft-cited and still excellent What Every Computer Scientist Should Know About Floating-Point Arithmetic, by David Goldberg for a readable technical discussion for why this is so.) This can affect the harmonic mean calculations. For example, if we calculate the harmonic mean of the numbers \1ドル \cdots 100000\,ドル in that order, the correct value is approximately 8271.198621. However, the original code reports 8270.716797 which is off by a bit. The reason is that by the time we get to 100000, we're adding a very small number (1e-5) to a relatively much bigger number of about 12.09. One can use doubles instead of floats here, but while that helps, it only postpones the inevitable when dealing with floating point addition. Numerical analysis is important with these kinds of calculations to assure that the answers are as accurate as they need to be.

Speed

The desire for speed can sometimes conflict with the other goal of correctness, but in this case, we can serve both interests. When the calculation is parallelized, the effect is that we calculate 4 reciprocals and add 4 results at a time to 4 separate sums. Because the SSE instructions take about the same amount of time to do one division as to do four in parallel (or 8 in parallel if you can use SSE2 or newer) the results are obtained about 4x as quickly (or 8x with SSE2). This is where the real speed dividends are, and there's also potentially a small increase in accuracy because each sum is only (roughly) 25% of the total sum (given randomly distributed numbers as input), the loss of precision is slightly reduced.
Here is the C++ program I used to test all variations, including a pure C++ variant that uses doubles for calculation instead of float.

drivemean.cpp

The following code was used to evaluate the various versions for both speed and accuracy. The stopwatch.h code from this question and was designed for exactly this kind of purpose.

 #include <algorithm>
 #include <iostream>
 #include <iomanip>
 #include "stopwatch.h"
 extern "C" float harmonicMean(float *arr, unsigned len);
 extern "C" float harmonicMean2(float *arr, unsigned len);
 extern "C" float harmonicMeanP(float *arr, unsigned len);
 float hmean(float *arr, unsigned len)
 {
 double sum = 0;
 double n = len;
 for ( ; len; --len, ++arr)
 sum += 1.0/(*arr);
 return n/sum;
 }
 struct MeanTest
 {
 float (*calc)(float *, unsigned);
 std::string name;
 };
 static MeanTest test[]{ 
 {harmonicMean, "original" },
 {harmonicMean2, "tweaked" },
 {harmonicMeanP, "parallel" },
 {hmean, "C++ double" },
 {nullptr, ""},
 };
 int main()
 {
 const unsigned arrlen{100000};
 const unsigned iterations{1000};
 float nums[arrlen];
 float hm;
 std::iota(nums, nums+arrlen, 1);
 Stopwatch<std::chrono::milliseconds> sw;
 for (MeanTest *t = test; t->calc != nullptr; ++t) {
 sw.start();
 for (unsigned i = iterations; i; --i)
 hm = t->calc(nums, arrlen);
 sw.stop();
 std::cout << std::setprecision(10) << hm << ", " 
 << t->name << " time = " << sw.elapsed() << " ms\n";
 }
 }

Results

 8270.716797, original time = 496 ms
 8270.716797, tweaked time = 431 ms
 8271.209961, parallel time = 108 ms
 8271.198242, C++ double time = 675 ms

As you can see, the C++ version was slowest, but most accurate (only off by about 0.0004) but the parallel version was second most accurate but 4x faster than the next fastest.

Using SSE2 instructions would allow one to approach 8x faster and would also allow for some simplification in the use of the shufps instruction. I'll leave that fun for you.

This is a well put together answer and I greatly appreciate it. I have put most of your suggestions into my program (aside from SIMD usage) and will keep your suggestions in mind in any following Assembly programs! I thank you again for your input.

Stack Exchange Network

Calculating the harmonic average

1 Answer 1

Omit C prolog and epilog if possible

Don't `push`/`pop` registers you don't alter

Avoid subroutine calls

Reduce jumps per loop iteration

Use direct memory access for instructions

Avoid compare instructions if practical

Take advantage of SIMD where practical

Test your results

Correctness

Speed

drivemean.cpp

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Calculating the harmonic average

1 Answer 1

Omit C prolog and epilog if possible

Don't push/pop registers you don't alter

Avoid subroutine calls

Reduce jumps per loop iteration

Use direct memory access for instructions

Avoid compare instructions if practical

Take advantage of SIMD where practical

Test your results

Correctness

Speed

drivemean.cpp

Results

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions

Don't `push`/`pop` registers you don't alter