This program calculates the square root of an unsigned 32-bit number with some bit fiddling to get a very close approximation, then one iteration of the Babylonian method to get an exact value. I don't know if this could be faster, but hope it can.
I have trouble with clearly commenting my code, so I will gladly explain unclear portions of code.
[global sqrti]
sqrti:
mov eax, [esp+4]
xor ecx, ecx ; clear register to put our approximation in
; make the lower 16 bits of CX equal the even-numbered bits of EAX
shr eax, 3
rcr cx, 1
%rep 14
shr eax, 2
rcr cx, 1
%endrep
; do one iteration of the Babylonian method
or ecx, 1 ; prevent a division by zero
mov eax, [esp+4]
xor edx, edx ; clear top half of DIV's implied operand
div ecx
add eax, ecx
shr eax, 1
ret
3 Answers 3
The first thing which strikes me as odd is how you calculate your initial estimation. If you only take every 2nd bit into account that means a value like 0b10101010101010
would yield a 0
as your initial estimation (or 1
after the or ecx, 1
adjustment) - not very good. Also it's not clear to me why you skip the first 3 bits, that makes even your best estimates off by a factor of 2.
But I can see what you try to do: just take the "half" of the bits of the initial value. I think you can fix that by first or-ing the even with the uneven bits together and then apply the shifting and roring:
mov eax, [esp+4]
test eax, eax,
jnz positive
ret ; Return immediately when input is 0
positive:
mov ecx, eax
shl ecx
or eax, ecx
bsr ecx, eax ; Get highest bit for later adjustment
xor edx, edx
loop:
shr eax, 2
rcr dx
jnz loop
shr cl ; Adjust for skipped bits
inc cl
rol dx, cl
mov ecx, edx ; Use ecx to store estimation
Note that shr
modifies the zero flag, but rcr
doesn't so you can jump out of the loop as soon as eax
turns out to be 0.
Furthermore note that 1
is an implied parameter for shifting operations if none is given. So you can safe a byte each time you leave it out (and possibly a cycle).
The other thing which is strange is that you only do one iteration. You should repeat it till your estimated value varies only by 0 or 1 between iterations, otherwise it's hard to see how your sqrti function will be useful.
And I don't think you need the or ecx, 1
statement. Just make sure to return immediately if [esp+4]
turns out to be 0. And if its positive your ecx
register should never become 0 if you get the initial estimation right.
-
\$\begingroup\$ Thanks for your answer, David. Looking back, it does seem extremely odd for me to initially shift EAX right by 3. Looking at that number you've given, that could be very bad. Another possibility could be an AND with 0x55555555, shifting that right, and adding it to the original value. I only did one iteration because, in my trials, the estimate was usually off by only a few bits in the last place. In retrospect, I should probably check better. \$\endgroup\$Jintonation– Jintonation2014年05月03日 17:30:32 +00:00Commented May 3, 2014 at 17:30
-
\$\begingroup\$ I don't think adding would be good, because that might also zero out the even bits. I noticed however that my solution also has problems, because by short cutting the loop the inital cx value might be too high... Maybe you have to use the fixed rep after all, but I'm thinking about a better solution. \$\endgroup\$David Ongaro– David Ongaro2014年05月03日 18:34:30 +00:00Commented May 3, 2014 at 18:34
-
\$\begingroup\$ Ok, I think I found a solution:
bsf
to the rescue (that finds the highest set bit and is available since 386). You can use that to adjust the final estimation. I edited my answer accordingly. Note that I had to replaceecx
withedx
in the loop, because shift operations only allowcl
as a register parameter. \$\endgroup\$David Ongaro– David Ongaro2014年05月03日 19:24:43 +00:00Commented May 3, 2014 at 19:24 -
1\$\begingroup\$ @DavidOngaro: I believe you want
bsr
.bsf
finds the least significant bit that's set, not the most significant bit. \$\endgroup\$Jerry Coffin– Jerry Coffin2014年05月03日 22:52:09 +00:00Commented May 3, 2014 at 22:52
I suppose the first question to consider here is how accurate of an estimate of the square root you need/want. As it stands right now, your routine is quite fast, but accuracy is quite poor.
I did a quick test summing the computed square roots of a sample of numbers and compared the result against that from the sqrt
in the standard library:
for (int i = 0; i < 0x2fffff; i += 13)
sum += f(i);
The results I got were:
fsqrt: 285997356
sqrti: 3567533576
Your result is wrong by factor of about 12. Since this is the sum, your individual results may even be further off than that.
I have a difficult time imagining an application for which this level of (in)accuracy would be acceptable, but if it's even close to acceptable, I'd use an even (much) simpler approximation:
approx proc num:dword
mov eax, num
bsr ecx, eax
mov eax, 1
shr ecx, 1
shl eax, cl
ret
approx endp
This produces a result very quickly, and it's at least sort of in the right general range, anyway:
fsqrt: 285997356
approx: 200588695
That's obviously still not very accurate, but at least it's sort of close. If we want to improve that a little, we can add one more iteration of the same basic idea: remove the MSB that's set in the input, approximate the square root of what's left, and add that to our result:
approx proc num:dword
mov eax, num
bsr ecx, eax
mov ebx, 1
push ecx
shr ecx, 1
shl ebx, cl
mov esi, 1
pop ecx
shl esi, cl
sub eax, esi
bsr ecx, eax
shr ecx, 1
mov edi, 1
shl edi, cl
or ebx, edi
mov eax, ebx
ret
approx endp
For the test run, the result now looks like this:
fsqrt: 285997356
approx: 281530962
This obviously isn't very accurate yet, but at least I can imagine cases where it might easily be accurate enough, and it is pretty fast. If it's not accurate enough for your purposes, the next step is probably to just use fsqrt
. Although (at least by my timing) this routine is faster than fsqrt
, it already takes about two thirds the time that fsqrt does, so adding even one more iteration would put it about even with fsqrt
.
If you want a routine that's accurate and nearly competitive with fsqrt, you might consider the binary reducing algorithm instead:
isqrt proc num:dword
mov eax, num
xor ebx, ebx
bsr ecx, eax
and cl, 0feh
mov edx, 1
shl edx, cl
refine:
mov esi, ebx
add esi, edx
cmp esi, eax
ja @f
sub eax, esi
shr ebx, 1
add ebx, edx
jmp next
@@:
shr ebx, 1
next :
shr edx, 2
jnz refine
mov eax, ebx
ret
isqrt endp
Depending upon circumstances, calling conventions, etc., this can be marginally faster than the sqrt
in the C standard library (but it's often a little slower), and at least in my testing it produces identical results:
fsqrt: 285997356
isqrt: 285997356
If you want to improve speed and can tolerate some inaccuracy, you could probably do a bit to loosen the exit condition from that, and come up with something that produced reasonably accurate results a little more quickly. I'll leave that as the dreaded "exercise for the reader."
Rereading, I notice what I've written above isn't quite accurate: the references to fsqrt
aren't actually comparing to using the fsqrt
instruction directly--they're to calling the sqrt
in the C standard library. I haven't looked though--it's possible the compiler is generating code for that inline, so the comparison really is to the fsqrt
instruction without little (or nothing) else involved.
I made a few experiments and tested them by timing one billion calculations of sqrt(400000000). With your implementation, it takes 12 seconds on my PC.
I had a few ideas, but none of the simple ones improves the run time.
It would be possible to avoid loading the parameter from memory twice (mov eax, [esp+4]) by storing it into ebx, but this doesn't change anything.
Another attempt I made was to split the bits extraction part into two: copy eax into ebx, shift ebx by 2, and then, extract one bit from eax, shift by 4, extract one bit from ebx, shift by 4, etc. %rep 14 becomes %rep 7, of course. I thought this might make room for some parallel execution of the two halves, but somehow, it didn't (maybe because of the contention on cx ?).
Now, for something that actually works :-) Are you open to the modern additions to the x86 instruction set ? Haswell processors support the BMI2 instruction set, part of AVX2, which contains a neat new instruction named PEXT, for parallel extract. This has been available for some time on MMX registers via SSE4 (introduced on Nehalem), now it has come to regular registers.
PEXT dest, src, mask extracts bits from src, keeping those that are set in mask, and puts the result in dest. In the present case, this part:
; make the lower 16 bits of CX equal the even-numbered bits of EAX
shr eax, 3
rcr cx, 1
%rep 14
shr eax, 2
rcr cx, 1
%endrep
becomes
mov edx, 0x55555555 ; odd-numbered bits = 0, even-numbered bits = 1
pext ecx, eax, edx ; extract the even-numbered bits of eax into ecx
So, what does it change ? Now, the billion calculations take 4,7 seconds, so I guess this speaks for itself. Of course, the program now requires a Haswell processor.
-
\$\begingroup\$ My processor isn't a Haswell, so I couldn't try the PEXT instruction. However, I found the SSE4 instruction you were talking about: EXTRQ. It does the same thing, but with an XMM register. I'll see if I can use that instead. \$\endgroup\$Jintonation– Jintonation2014年05月03日 17:42:41 +00:00Commented May 3, 2014 at 17:42
Explore related questions
See similar questions with these tags.