In the Wikipedia article about the binary search algorithm, a small paragraph in particular got my attention:
Midpoint and width
A different variation involves abandoning the L and R pointers and using a current position p and a width w. At each iteration, the position p is adjusted and the width is halved. Knuth states, "It is possible to do this, but only if extreme care is paid to the details."
Why would Mr Donald Knuth, author of the 1973 book 'The Art of Computer Programming' express this reserve?
Another Wikipedia article states:
Knuth considers the use of assembly language necessary for the speed and memory usage of algorithms to be judged.
And so, today I present both these methods for binary searching an ordered list of unsigned bytes using X86 assembly language. I have placed them alongside for ease of comparing.
On entry, AL
has the value to find in the array at DS:BX
having CX
elements, because all logic involved is unsigned the number of elements can be as high as 65535.
On exit, a set CF
denotes not found and if the CF
is clear, AX
returns the 0-based index to where the value resides in the array.
push cx push cx
push si push si
push di push di
mov si,cx xor si,si
mov cx,1 mov di,cx
mov di,si <+++++ add si,di <++++++++ (6)
cmp di,cx <++++++++ stc +
+++ jb + + (1) +++ jcxz <+++++++++++ +
+ mov si,cx + + + mov di,cx + +
+ add si,di + + + shr cx,1 + +
+ rcr si,1 + + (2) + sub di,cx + +
+ dec si + + (3) + sub si,di + + (7)
+ cmp al,[bx+si] + + (10) + cmp al,[bx+si] + + (10)
+ jb ++++++++++++ + (11) + jb ++++++++++++ + (11)(12)
+ lea cx,[si+2] + (4) + lea cx,[di-1] + (4)
+ ja +++++++++++++++ (11) + ja +++++++++++++++ (11)
+ xchg ax,si (5) + xchg ax,si (5)
++> pop di ++> pop di
pop si pop si
pop cx pop cx
ret ret
Method 1 (on the left) uses Left (CX
) and Right (DI
) points that gradually
move closer to each other. A Mid (SI
) point is derived from them through
simple averaging.
Method 2 (on the right) uses Left (CX
) and Right (DI
) widths that when added represent the total width of the current list. The Mid (SI
) point is
placed at the start of the right partition.
- By using 1-based indexes instead of the usual 0-based ones I managed to eliminate an extra test still retaining the necessary
CF
. - The previous addition produced a 17-bit result and so the
CF
was picked up. - The Mid point is still 0-based.
- Very nice to have
lea
not changing the flags. - This
xchg ax,si
replaces the longermov ax,si
. SI
becomes Rear point. It proved more efficient to calculate the Midpoint coming from the rear of the current list rather than coming from the front of the current list.SI
becomes Midpoint.
This code can easily be adapted:
To search through an array of words replace this single
cmp al,[bx+si]
withpush si
shl si,1
cmp ax,[bx+si]
pop si
. It will even make the code run faster!To search through an array of dwords replace the same
cmp al,[bx+si]
withpush si
shl si,2
cmp eax,[bx+si]
pop si
. It will make the code run even more faster!With an array of signed values change
jb
intojl
andja
intojg
.With an array of signed values make this jump go 1 line higher to
stc
.
In conclusion:
Method 2 is 3% shorter (1 byte) and 3% faster. Since both code snippets are so similar it's not surprising that they execute in about the same time. I could not be bothered by method 2 being faster because binary search typically does very few iterations and running this code on another computer will certainly show different results. I find both methods elegant but I've come to prefer the second one.
Would my conclusion still hold using a high level language?
I translated both methods to QBasic functions and saw that method 2 is now longer in lines, 10% slower, and a bit less elegant. This clearly demonstrates that using compound expressions is very beneficial to an interpreted language.
Also noteworthy is the use of the integer divide operator. It revealed a more than 50% speed increase over the usual divide operator.
FUNCTION Find1% (Arr%(), Num%) FUNCTION Find2% (Arr%(), Num%) Find1% = -1 Find2% = -1 posL% = 1 sizL% = UBOUND(Arr%) + 1 posR% = UBOUND(Arr%) + 1 posM% = sizL% DO UNTIL posR% < posL% DO WHILE sizL% posM% = ((posL% + posR%) \ 2) - 1 sizR% = sizL% IF Num% < Arr%(posM%) THEN sizL% = sizL% \ 2 posR% = posM% sizR% = sizR% - sizL% ELSEIF Num% > Arr%(posM%) THEN posM% = posM% - sizR% posL% = posM% + 2 IF Num% > Arr%(posM%) THEN ELSE sizL% = sizR% - 1 Find1% = posM% posM% = posM% + sizR% EXIT FUNCTION ELSEIF Num% = Arr%(posM%) THEN END IF Find2% = posM% LOOP EXIT FUNCTION END FUNCTION END IF LOOP END FUNCTION ; ----------------------------------------------------------------------
1 Answer 1
There is one obvious reason to prefer method 2, and that is the possibility of overflow when calculating the midpoint in method 1. Adding the left and right widths together first, before dividing, risks the intermediate result of the addition overflowing the destination type.
This is not a problem in the assembly code, because you're explicitly treating the values as unsigned and the semantics are well-defined.
It is, however, a problem in the QBasic translation, since QBasic doesn't have unsigned types. I'm not familiar enough with the QBasic language specification to say exactly what happens in the event of an overflow, but I believe (based on what I know about its descendant language, Visual Basic) that overflow checks will be inserted after every arithmetic operation (unless you've disabled them in the compiler options, which won't exist for the interpreted-only QBasic). Specifically, jo
instructions will conditionally jump to an error-handling routine in the event of an overflow, which will raise an Error with Basic language semantics. In addition to slowing down execution, you'll have to ensure that you either handle this error (which will slow things down more) or suffer the consequences of code that breaks in edge cases.
I agree that your assembly code is elegant, well-written, and clear.
(Although, for what it's worth, I find the arrows showing the flow of control more confusing than the normal labels and branch targets. The similar-looking arrows all blend together when I try to study them. Maybe it's because I'm not a visual learner.)
method 2 is 3% shorter (1 byte) and 3% faster. Since both code snippets are so similar it's not surprising that they execute in about the same time.
I don't know what processor you're running your performance tests on in order to claim this result, but the fact that you've made code shorter does not necessarily make it faster.
(I'm going to talk about a bunch of obsolete architectures because I'm assuming that's the most likely target for 16-bit x86 assembly code! I'll try to throw in some notes on modern processors, too.)
On the 8086, and especially the 8088, it is true that shorter code is almost always better, since instruction prefetching is so slow and there is no cache. Although there are some instructions that are slower than others, you can pretty much substitute a multi-byte instruction for a single-byte instruction and see a speed-up, as long as the cycle counts are comparable.
That was no longer so true on the 286, and become completely untrue on the 386 and later. Especially on the 386, there are a bunch of "legacy" CISC-style instructions that are very slow. Replacing them with simpler instructions (regardless of whether or not they are shorter in terms of bytes) almost always makes the code execute more quickly. And thus was introduced the classic space vs. size tradeoff for optimizing compilers.
A case in point is the jcxz
instruction, which tests the value in the cx
register to see if it is equal to zero and, if so, executes a conditional branch. Notice that this is equivalent to:
test cx, cx ; a faster way to do "cmp cx, 0"
je BranchTarget
Even though jcxz
is 2 bytes compared to 2+2 bytes for test
+je
, jcxz
is slower. On the 286–486, jcxz
executes in ~8 clock cycles when the branch is taken and ~5 clock cycles when the branch is not taken. The alternative is 1 cycle for the test reg, reg
, plus 7 cycles on 286/386 when je
is taken or 3 cycles when not taken (on 486, je
is even faster—3 cycles when taken, 1 cycle otherwise). Thus, in the worst case on the 286/386, test
+je
is equally as fast as jcxz
, while in the best case, it is about twice as fast. On the 486, test
+je
murders jcxz
speed-wise. Intel's optimization manuals at this point in history explicitly recommended using test
+je
instead of jcxz
.
Similarly, the 2-byte xchg reg, reg
was generally the fastest way to exchange two enregistered values on the 808x. It executed in 4 cycles, which was the same as two mov reg, reg
instructions (2 bytes and 2 cycles each), but it was half the size, so it was quicker to fetch. Things changed on the 286, and on all subsequent processors. xchg reg, reg
was now a 3-cycle instruction, but mov reg, reg
became a 1-cycle instruction, so doing two mov
s was faster than a single xchg
. That is, assuming you had an extra register to spare. If not, given the limited register set, you might still have been better off with xchg
. The 486 changed that yet again: since xor reg, reg
ran in only 1 cycle, you could use three xor
s in the classic swap trick as fast or faster than a single xchg
(faster if you could interleave the three xor
s within other instructions you needed to execute anyway to reduce dependencies).
The Pentium had dual execution engines, known as the U and V pipes. The U pipe was basically a full-featured 486, while the V pipe was more limited and could only execute a subset of instructions. This made instruction pairing a big deal, because if you properly paired your instructions such that one could execute on the U pipe while the other executed on the V pipe, you could effectively double performance. The xor
trick becomes even more attractive here than a single xchg
, because xor
paired on either pipes and could be even more effectively interleaved within a complicated sequence of instructions. Pentium Pro introduced out-of-order execution, which made perfect instruction pairing less important, but much of the same optimization logic still applied. The CISC-style instructions are very slow on modern processors and should not be used even when they are fewer bytes. The days of rote cycle-counting are long gone, however.
All of that to say that you can probably improve the performance of method 2 even further by choosing more optimal instruction sequences, even if that results in larger code.
Another way to speed the code up is to replace conditional branches with clever bitwise or arithmetic operations that accomplish the same thing branchlessly. This is a huge performance win on the 808x and extremely significant on any processor that lacks a branch predictor (introduced with the Pentium). Even on modern processors, there are massive penalties if branch prediction is unsuccessful (not strongly biased for taken or not-taken), so rewriting branching code to be branchless can still result in a performance speed-up. You've got a tight loop here, the body of which is a perfect place to apply such optimizations. This is equally true for either method (and, at first glance, it looks like the branches could be more elegantly removed in method 1).
You could also ditch the relatively-slow stc
instruction if you replaced jcxz
with cmp
+je
, instead of test
+je
. Normally, you would prefer test
for a comparison against zero, but in this case, cmp
is likely a better choice, because it sets the carry flag (CF) according to its result, while test
does not affect CF.
One thing that likely makes a big difference in the relative performance of method 1 and method 2 is replacing the rcr
instruction with a shr
instruction. Or at least, depending on which processor you're using to do the timing. rcr reg, 1
is extremely slow on the 386—about three times slower than shr reg, 1
. The difference isn't there on 808x, 286, Pentium I/II/III, and most AMD processors. However, on recent Intel processors (starting from Pentium M, extending through Core 2, and later up to current microarchitectures, including Atom), rcr
is once again about three times slower than shr
(higher latency and less throughput).
Finally, you aren't seeing it here because you're doing an apples-to-apples comparison, but 16-bit instructions are one-byte longer (because of a size prefix) and noticeably slower when running in 32-bit protected mode. If you're not running in real mode, you could speed up the code "for free" by switching to 32-bit instructions, using movzx
(or xor r32, r32
+ mov r32, r/m16
) to initialize 32-bit registers with unsigned 16-bit integer values.
A brief detour to consider an especially confounding optimization case—at least, historically speaking…
The following instruction: cmp al, [bx+si]
has a register as its destination operand and a memory address as its source operand. On the 286 and 386, this instruction executes in 6 clocks. Had we swapped the order of the parameters, so that the memory address was the destination operand and the register was the source operand, it would execute slightly slower on the 286 (7 clocks) but slightly faster on the 386 (5 clocks).
Either form could be used, since cmp
doesn't modify its destination operand. You'd just have to adjust the direction of the subsequent branch, since the flags will be set for the opposite comparison. The struggle is, which one do you use? If you're optimizing for the 286, you'd use cmp mem, reg
, but if you're optimizing for the 386, you'd use cmp reg, mem
. Fortunately, on the 486 and all later processors, the choice is irrelevant because they execute at identical speeds.
Would my conclusion still hold using a high level language?
Critically missing here is the word interpreted. The language's level of abstraction does not matter as long as it is compiled to machine code, preferably by an optimizing compiler. Theoretically, you could write this in any high-level language of your choosing and it would be translated to assembly code that is equivalent if not even more optimal than what you've written by hand.
In fact, to demonstrate that, here is a C++ translation of your QBasic code for Find2
:
uint32_t Find2(uint32_t* pArr, uint32_t cArr, uint32_t num)
{
uint32_t result = static_cast<uint32_t>(-1);
uint32_t sizL = cArr;
uint32_t posM = sizL;
while (sizL)
{
uint32_t sizR = sizL;
sizL /= 2U;
sizR -= sizL;
posM -= sizR;
if (num == pArr[posM])
{
return posM;
}
else if (num > pArr[posM])
{
sizL = (sizR - 1U);
posM += sizR;
}
}
return result;
}
And here is the object code generated by MSVC 2010, targeting a modern 32-bit x86 processor:
Find2 PROC
mov eax, DWORD PTR [cArr]
push ebx
push esi
push edi
mov ecx, eax
test eax, eax
je SHORT L3
mov edi, DWORD PTR [num]
mov ebx, DWORD PTR [pArr]
L1:
mov edx, ecx ; the following inner-loop sequence looks *very* familiar!
shr ecx, 1
sub edx, ecx
sub eax, edx
mov esi, DWORD PTR [ebx+eax*4] ; your CMP was split up into two separate load-compare
cmp edi, esi ; instructions b/c this is faster on PPro and later
je SHORT L4
jbe SHORT L2 ; not sure why JBE is used instead of JB; doesn't matter though
lea ecx, DWORD PTR [edx-1]
add eax, edx
L2:
test ecx, ecx ; this seems a poor optimization choice, to branch to a separate
jne SHORT L1 ; block here instead of sharing the instructions preceding the
; loop (possibly forced into this by poor register allocation?)
;
L3:
or eax, -1 ; fast way to set register to -1 on older CPUs, but
; may be slower on modern CPUs due to false dependency
L4:
pop edi
pop esi
pop ebx
ret
Find2 ENDP
Okay, compilers aren't [yet] quite as good as we might hope, but still, there are a lot of similarities between what it generates and your hand-written assembly code, especially in the inner loop. I've annotated these similarities, a few interesting decisions made by the optimizer, and one of the questionable optimization choices. Another big difference is that the compiler isn't returning a result in the carry flag, and is therefore unable to utilize it in as clever a way as your hand-written assembly does. You can see this manifest most visibly in the need to explicitly initialize a register with −1 in order to return it.
Note, however, the code we're analyzing is just a literal translation of your QBasic code into C++ (which is itself a pretty literal translation of the assembly code into QBasic). I suspect that there is a more clever way to write the C++ code that will generate even better output from the compiler. You obviously spent quite a bit of time and gave quite a bit of thought to tweaking that assembly code. Imagine if you spent the same amount of time thinking, tweaking, and refining the C++ code. I think you could get very close performance-wise. I suspect the same is probably true of QBasic.
I translated both methods to QBasic functions and saw that method 2 is now longer in lines, 10% slower, and a bit less elegant. This clearly demonstrates that using compound expressions is very beneficial to an interpreted language.
I don't think that's a valid conclusion. That's not to say it isn't true that compound expressions execute more quickly in an interpreted language, but I don't think you have "clearly demonstrated" that here. Nor do I think you have proven that method 1 is superior to method 2 in interpreted languages. All you've proven is that you managed to translate method 1 into code that runs more efficiently in QBasic, whereas you weren't able to do the same for method 2.
I can trivially write method 1 using the same number of lines and the same looping construct as method 2—do you think it'll run more slowly now?
FUNCTION Find1% (Arr%(), Num%)
Find1% = -1
posL% = 1
posR% = UBOUND(Arr%) + 1
DO WHILE posR% >= posL%
posM% = posL% + posR%
posM% = posM% \ 2
posM% = posM% - 1
IF Num% < Arr%(posM%) THEN
posR% = posM%
ELSEIF
posL% = posM% + 2
ELSE
Find1% = posM%
EXIT FUNCTION
END IF
LOOP
END FUNCTION
This could suggest a more interesting and more fair strategy for performance comparison.
Also noteworthy is the use of the integer divide operator. It revealed a more than 50% speed increase over the usual divide operator.
Noteworthy, perhaps, but not surprising. If you don't use the integer divide operator, you get a floating-point division. Semantically, then, this is really just a bug: you're dividing two integers using the floating-point division operator.
Why is it so much slower to do the division in floating-point? Well, there are a couple of reasons. First, you have to pay for a conversion from the integer value into a floating-point value, and then a conversion from the floating-point result back into an integer value. Aside from the fact that this pointlessly injects a bunch of extra instructions, it also vastly increases the latency. Second, there's the fact that the QBasic system is so old that it can only emit instructions that use the x87 FPU, which is less efficient than SSE/SSE2 instructions and slower to move values into and out of because they have to round-trip through memory. In fact, it is probably even worse than that. Given its age, QBasic probably doesn't even use FPU instructions. Back in the day, not all systems had FPUs installed, and floating-point operations had to be simulated using a routines provided by a runtime library, where you not only had to pay the cost of a function call, but the simulated code was much slower than a simple FDIV
.
-
\$\begingroup\$ Huh. I'm using VS2010, but that's not the assembler it's generating for me at all. We must be using different optimization settings? That's not how
/FA
is formatting my output either. Honestly, that L1 L2 stuff looks more like gcc? I was going to point out an amusing optimization quirk that produced better code in VS2010, but there's not much point since our output is so very different. Still, an excellent analysis on this rather old question. Hopefully OP will come back and accept. \$\endgroup\$David Wohlferd– David Wohlferd2017年01月07日 04:02:48 +00:00Commented Jan 7, 2017 at 4:02 -
\$\begingroup\$ @david I reformatted the output for readability, including changing the labels, but didn't change the code itself. This is at /O2 optimization, targeting 32-bit. I just checked again to make sure I didn't accidentally use the wrong output. It does get shorter if you use /O1, but in my experience, MSVC does a number of dumb things at that optimization level, so I never use it globally. In this case, /O1 causes it to re-load values from memory inside of the loop, which doesn't make a lot of sense for an inner loop. What is interesting is that VS 2010 seems to produce better code than VS 2015. \$\endgroup\$Cody Gray– Cody Gray2017年01月07日 12:43:12 +00:00Commented Jan 7, 2017 at 12:43
-
\$\begingroup\$ Please forgive me my neglect of your answer these past years. It truly is a very interesting review! One observation about replacing
stc
+jcxz
withcmp
+je
. This could never produce the wantedCF=1
to indicate failure. What will work iscmp cx, 1
+jb
. Again, sorry for the long wait. \$\endgroup\$Sep Roland– Sep Roland2020年08月02日 21:00:55 +00:00Commented Aug 2, 2020 at 21:00
END IF
. In general, it seems like any difference between the assembly version and the qbasic version are simply a reflection of the efficiency of the qbasic compiler. Although I'm not familiar with qbasic. Is it an interpreter? \$\endgroup\$