Integer square root in x86 assembly (NASM)

Question 1

This program calculates the square root of an unsigned 32-bit number with some bit fiddling to get a very close approximation, then one iteration of the Babylonian method to get an exact value. I don't know if this could be faster, but hope it can.

I have trouble with clearly commenting my code, so I will gladly explain unclear portions of code.

[global sqrti]
sqrti:
 mov eax, [esp+4]
 xor ecx, ecx ; clear register to put our approximation in
 ; make the lower 16 bits of CX equal the even-numbered bits of EAX
 shr eax, 3
 rcr cx, 1
%rep 14
 shr eax, 2
 rcr cx, 1
%endrep
 ; do one iteration of the Babylonian method
 or ecx, 1 ; prevent a division by zero
 mov eax, [esp+4]
 xor edx, edx ; clear top half of DIV's implied operand
 div ecx
 add eax, ecx
 shr eax, 1
 ret

Question 2

The first thing which strikes me as odd is how you calculate your initial estimation. If you only take every 2nd bit into account that means a value like 0b10101010101010 would yield a 0 as your initial estimation (or 1 after the or ecx, 1 adjustment) - not very good. Also it's not clear to me why you skip the first 3 bits, that makes even your best estimates off by a factor of 2.

But I can see what you try to do: just take the "half" of the bits of the initial value. I think you can fix that by first or-ing the even with the uneven bits together and then apply the shifting and roring:

 mov eax, [esp+4]
 test eax, eax,
 jnz positive
 ret ; Return immediately when input is 0
positive:
 mov ecx, eax
 shl ecx
 or eax, ecx
 bsr ecx, eax ; Get highest bit for later adjustment
 xor edx, edx
loop:
 shr eax, 2
 rcr dx
 jnz loop
 shr cl ; Adjust for skipped bits
 inc cl
 rol dx, cl
 mov ecx, edx ; Use ecx to store estimation

Note that shr modifies the zero flag, but rcr doesn't so you can jump out of the loop as soon as eax turns out to be 0.

Furthermore note that 1 is an implied parameter for shifting operations if none is given. So you can safe a byte each time you leave it out (and possibly a cycle).

The other thing which is strange is that you only do one iteration. You should repeat it till your estimated value varies only by 0 or 1 between iterations, otherwise it's hard to see how your sqrti function will be useful.

And I don't think you need the or ecx, 1 statement. Just make sure to return immediately if [esp+4] turns out to be 0. And if its positive your ecx register should never become 0 if you get the initial estimation right.

Question 3

Thanks for your answer, David. Looking back, it does seem extremely odd for me to initially shift EAX right by 3. Looking at that number you've given, that could be very bad. Another possibility could be an AND with 0x55555555, shifting that right, and adding it to the original value. I only did one iteration because, in my trials, the estimate was usually off by only a few bits in the last place. In retrospect, I should probably check better.

Question 4

I don't think adding would be good, because that might also zero out the even bits. I noticed however that my solution also has problems, because by short cutting the loop the inital cx value might be too high... Maybe you have to use the fixed rep after all, but I'm thinking about a better solution.

Question 5

Ok, I think I found a solution: bsf to the rescue (that finds the highest set bit and is available since 386). You can use that to adjust the final estimation. I edited my answer accordingly. Note that I had to replace ecx with edx in the loop, because shift operations only allow cl as a register parameter.

Question 6

@DavidOngaro: I believe you want bsr. bsf finds the least significant bit that's set, not the most significant bit.

Question 7

I suppose the first question to consider here is how accurate of an estimate of the square root you need/want. As it stands right now, your routine is quite fast, but accuracy is quite poor.

I did a quick test summing the computed square roots of a sample of numbers and compared the result against that from the sqrt in the standard library:

for (int i = 0; i < 0x2fffff; i += 13)
 sum += f(i);

The results I got were:

fsqrt: 285997356
sqrti: 3567533576

Your result is wrong by factor of about 12. Since this is the sum, your individual results may even be further off than that.

I have a difficult time imagining an application for which this level of (in)accuracy would be acceptable, but if it's even close to acceptable, I'd use an even (much) simpler approximation:

approx proc num:dword
 mov eax, num
 bsr ecx, eax
 mov eax, 1
 shr ecx, 1
 shl eax, cl
 ret
approx endp

This produces a result very quickly, and it's at least sort of in the right general range, anyway:

fsqrt: 285997356
approx: 200588695

That's obviously still not very accurate, but at least it's sort of close. If we want to improve that a little, we can add one more iteration of the same basic idea: remove the MSB that's set in the input, approximate the square root of what's left, and add that to our result:

approx proc num:dword
 mov eax, num
 bsr ecx, eax
 mov ebx, 1
 push ecx
 shr ecx, 1
 shl ebx, cl
 mov esi, 1
 pop ecx
 shl esi, cl
 sub eax, esi
 bsr ecx, eax
 shr ecx, 1
 mov edi, 1
 shl edi, cl
 or ebx, edi
 mov eax, ebx
 ret
approx endp

For the test run, the result now looks like this:

fsqrt: 285997356
approx: 281530962

This obviously isn't very accurate yet, but at least I can imagine cases where it might easily be accurate enough, and it is pretty fast. If it's not accurate enough for your purposes, the next step is probably to just use fsqrt. Although (at least by my timing) this routine is faster than fsqrt, it already takes about two thirds the time that fsqrt does, so adding even one more iteration would put it about even with fsqrt.

If you want a routine that's accurate and nearly competitive with fsqrt, you might consider the binary reducing algorithm instead:

isqrt proc num:dword
 mov eax, num
 xor ebx, ebx
 bsr ecx, eax
 and cl, 0feh
 mov edx, 1
 shl edx, cl
refine:
 mov esi, ebx
 add esi, edx
 cmp esi, eax
 ja @f
 sub eax, esi
 shr ebx, 1
 add ebx, edx
 jmp next
@@:
 shr ebx, 1
next :
 shr edx, 2
 jnz refine
 mov eax, ebx
 ret
isqrt endp

Depending upon circumstances, calling conventions, etc., this can be marginally faster than the sqrt in the C standard library (but it's often a little slower), and at least in my testing it produces identical results:

fsqrt: 285997356
isqrt: 285997356

If you want to improve speed and can tolerate some inaccuracy, you could probably do a bit to loosen the exit condition from that, and come up with something that produced reasonably accurate results a little more quickly. I'll leave that as the dreaded "exercise for the reader."

Rereading, I notice what I've written above isn't quite accurate: the references to fsqrt aren't actually comparing to using the fsqrt instruction directly--they're to calling the sqrt in the C standard library. I haven't looked though--it's possible the compiler is generating code for that inline, so the comparison really is to the fsqrt instruction without little (or nothing) else involved.

Question 8

I made a few experiments and tested them by timing one billion calculations of sqrt(400000000). With your implementation, it takes 12 seconds on my PC.

I had a few ideas, but none of the simple ones improves the run time.

It would be possible to avoid loading the parameter from memory twice (mov eax, [esp+4]) by storing it into ebx, but this doesn't change anything.

Another attempt I made was to split the bits extraction part into two: copy eax into ebx, shift ebx by 2, and then, extract one bit from eax, shift by 4, extract one bit from ebx, shift by 4, etc. %rep 14 becomes %rep 7, of course. I thought this might make room for some parallel execution of the two halves, but somehow, it didn't (maybe because of the contention on cx ?).

Now, for something that actually works :-) Are you open to the modern additions to the x86 instruction set ? Haswell processors support the BMI2 instruction set, part of AVX2, which contains a neat new instruction named PEXT, for parallel extract. This has been available for some time on MMX registers via SSE4 (introduced on Nehalem), now it has come to regular registers.

PEXT dest, src, mask extracts bits from src, keeping those that are set in mask, and puts the result in dest. In the present case, this part:

; make the lower 16 bits of CX equal the even-numbered bits of EAX
 shr eax, 3
 rcr cx, 1
%rep 14
 shr eax, 2
 rcr cx, 1
%endrep

becomes

mov edx, 0x55555555 ; odd-numbered bits = 0, even-numbered bits = 1
pext ecx, eax, edx ; extract the even-numbered bits of eax into ecx

So, what does it change ? Now, the billion calculations take 4,7 seconds, so I guess this speaks for itself. Of course, the program now requires a Haswell processor.

Question 9

My processor isn't a Haswell, so I couldn't try the PEXT instruction. However, I found the SSE4 instruction you were talking about: EXTRQ. It does the same thing, but with an XMM register. I'll see if I can use that instead.

David Ongaro David Ongaro 5394 silver badges15 bronze badges · Answer 1 · 2014-05-03 09:31:12Z

The first thing which strikes me as odd is how you calculate your initial estimation. If you only take every 2nd bit into account that means a value like 0b10101010101010 would yield a 0 as your initial estimation (or 1 after the or ecx, 1 adjustment) - not very good. Also it's not clear to me why you skip the first 3 bits, that makes even your best estimates off by a factor of 2.

But I can see what you try to do: just take the "half" of the bits of the initial value. I think you can fix that by first or-ing the even with the uneven bits together and then apply the shifting and roring:

 mov eax, [esp+4]
 test eax, eax,
 jnz positive
 ret ; Return immediately when input is 0
positive:
 mov ecx, eax
 shl ecx
 or eax, ecx
 bsr ecx, eax ; Get highest bit for later adjustment
 xor edx, edx
loop:
 shr eax, 2
 rcr dx
 jnz loop
 shr cl ; Adjust for skipped bits
 inc cl
 rol dx, cl
 mov ecx, edx ; Use ecx to store estimation

Note that shr modifies the zero flag, but rcr doesn't so you can jump out of the loop as soon as eax turns out to be 0.

Furthermore note that 1 is an implied parameter for shifting operations if none is given. So you can safe a byte each time you leave it out (and possibly a cycle).

The other thing which is strange is that you only do one iteration. You should repeat it till your estimated value varies only by 0 or 1 between iterations, otherwise it's hard to see how your sqrti function will be useful.

And I don't think you need the or ecx, 1 statement. Just make sure to return immediately if [esp+4] turns out to be 0. And if its positive your ecx register should never become 0 if you get the initial estimation right.

Thanks for your answer, David. Looking back, it does seem extremely odd for me to initially shift EAX right by 3. Looking at that number you've given, that could be very bad. Another possibility could be an AND with 0x55555555, shifting that right, and adding it to the original value. I only did one iteration because, in my trials, the estimate was usually off by only a few bits in the last place. In retrospect, I should probably check better.
I don't think adding would be good, because that might also zero out the even bits. I noticed however that my solution also has problems, because by short cutting the loop the inital cx value might be too high... Maybe you have to use the fixed rep after all, but I'm thinking about a better solution.
Ok, I think I found a solution: bsf to the rescue (that finds the highest set bit and is available since 386). You can use that to adjust the final estimation. I edited my answer accordingly. Note that I had to replace ecx with edx in the loop, because shift operations only allow cl as a register parameter.
@DavidOngaro: I believe you want bsr. bsf finds the least significant bit that's set, not the most significant bit.

Jerry Coffin Jerry Coffin 34.1k4 gold badges77 silver badges145 bronze badges · Answer 2 · 2014-05-03 23:59:04Z

I suppose the first question to consider here is how accurate of an estimate of the square root you need/want. As it stands right now, your routine is quite fast, but accuracy is quite poor.

I did a quick test summing the computed square roots of a sample of numbers and compared the result against that from the sqrt in the standard library:

for (int i = 0; i < 0x2fffff; i += 13)
 sum += f(i);

The results I got were:

fsqrt: 285997356
sqrti: 3567533576

Your result is wrong by factor of about 12. Since this is the sum, your individual results may even be further off than that.

I have a difficult time imagining an application for which this level of (in)accuracy would be acceptable, but if it's even close to acceptable, I'd use an even (much) simpler approximation:

approx proc num:dword
 mov eax, num
 bsr ecx, eax
 mov eax, 1
 shr ecx, 1
 shl eax, cl
 ret
approx endp

This produces a result very quickly, and it's at least sort of in the right general range, anyway:

fsqrt: 285997356
approx: 200588695

That's obviously still not very accurate, but at least it's sort of close. If we want to improve that a little, we can add one more iteration of the same basic idea: remove the MSB that's set in the input, approximate the square root of what's left, and add that to our result:

approx proc num:dword
 mov eax, num
 bsr ecx, eax
 mov ebx, 1
 push ecx
 shr ecx, 1
 shl ebx, cl
 mov esi, 1
 pop ecx
 shl esi, cl
 sub eax, esi
 bsr ecx, eax
 shr ecx, 1
 mov edi, 1
 shl edi, cl
 or ebx, edi
 mov eax, ebx
 ret
approx endp

For the test run, the result now looks like this:

fsqrt: 285997356
approx: 281530962

This obviously isn't very accurate yet, but at least I can imagine cases where it might easily be accurate enough, and it is pretty fast. If it's not accurate enough for your purposes, the next step is probably to just use fsqrt. Although (at least by my timing) this routine is faster than fsqrt, it already takes about two thirds the time that fsqrt does, so adding even one more iteration would put it about even with fsqrt.

If you want a routine that's accurate and nearly competitive with fsqrt, you might consider the binary reducing algorithm instead:

isqrt proc num:dword
 mov eax, num
 xor ebx, ebx
 bsr ecx, eax
 and cl, 0feh
 mov edx, 1
 shl edx, cl
refine:
 mov esi, ebx
 add esi, edx
 cmp esi, eax
 ja @f
 sub eax, esi
 shr ebx, 1
 add ebx, edx
 jmp next
@@:
 shr ebx, 1
next :
 shr edx, 2
 jnz refine
 mov eax, ebx
 ret
isqrt endp

Depending upon circumstances, calling conventions, etc., this can be marginally faster than the sqrt in the C standard library (but it's often a little slower), and at least in my testing it produces identical results:

fsqrt: 285997356
isqrt: 285997356

If you want to improve speed and can tolerate some inaccuracy, you could probably do a bit to loosen the exit condition from that, and come up with something that produced reasonably accurate results a little more quickly. I'll leave that as the dreaded "exercise for the reader."

Rereading, I notice what I've written above isn't quite accurate: the references to fsqrt aren't actually comparing to using the fsqrt instruction directly--they're to calling the sqrt in the C standard library. I haven't looked though--it's possible the compiler is generating code for that inline, so the comparison really is to the fsqrt instruction without little (or nothing) else involved.

Zoyd Zoyd 4573 silver badges9 bronze badges · Answer 3 · 2014-05-03 09:45:28Z

I made a few experiments and tested them by timing one billion calculations of sqrt(400000000). With your implementation, it takes 12 seconds on my PC.

I had a few ideas, but none of the simple ones improves the run time.

It would be possible to avoid loading the parameter from memory twice (mov eax, [esp+4]) by storing it into ebx, but this doesn't change anything.

Another attempt I made was to split the bits extraction part into two: copy eax into ebx, shift ebx by 2, and then, extract one bit from eax, shift by 4, extract one bit from ebx, shift by 4, etc. %rep 14 becomes %rep 7, of course. I thought this might make room for some parallel execution of the two halves, but somehow, it didn't (maybe because of the contention on cx ?).

Now, for something that actually works :-) Are you open to the modern additions to the x86 instruction set ? Haswell processors support the BMI2 instruction set, part of AVX2, which contains a neat new instruction named PEXT, for parallel extract. This has been available for some time on MMX registers via SSE4 (introduced on Nehalem), now it has come to regular registers.

PEXT dest, src, mask extracts bits from src, keeping those that are set in mask, and puts the result in dest. In the present case, this part:

; make the lower 16 bits of CX equal the even-numbered bits of EAX
 shr eax, 3
 rcr cx, 1
%rep 14
 shr eax, 2
 rcr cx, 1
%endrep

becomes

mov edx, 0x55555555 ; odd-numbered bits = 0, even-numbered bits = 1
pext ecx, eax, edx ; extract the even-numbered bits of eax into ecx

So, what does it change ? Now, the billion calculations take 4,7 seconds, so I guess this speaks for itself. Of course, the program now requires a Haswell processor.

My processor isn't a Haswell, so I couldn't try the PEXT instruction. However, I found the SSE4 instruction you were talking about: EXTRQ. It does the same thing, but with an XMM register. I'll see if I can use that instead.

Stack Exchange Network

Integer square root in x86 assembly (NASM)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Integer square root in x86 assembly (NASM)

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions