Variations on binary searching an ordered list

Question 1

In the Wikipedia article about the binary search algorithm, a small paragraph in particular got my attention:

Midpoint and width

A different variation involves abandoning the L and R pointers and using a current position p and a width w. At each iteration, the position p is adjusted and the width is halved. Knuth states, "It is possible to do this, but only if extreme care is paid to the details."

Why would Mr Donald Knuth, author of the 1973 book 'The Art of Computer Programming' express this reserve?

Another Wikipedia article states:

Knuth considers the use of assembly language necessary for the speed and memory usage of algorithms to be judged.

And so, today I present both these methods for binary searching an ordered list of unsigned bytes using X86 assembly language. I have placed them alongside for ease of comparing.

On entry, AL has the value to find in the array at DS:BX having CX elements, because all logic involved is unsigned the number of elements can be as high as 65535.

On exit, a set CF denotes not found and if the CF is clear, AX returns the 0-based index to where the value resides in the array.

 push cx push cx
 push si push si
 push di push di
 mov si,cx xor si,si
 mov cx,1 mov di,cx
 mov di,si <+++++ add si,di <++++++++ (6)
 cmp di,cx <++++++++ stc +
+++ jb + + (1) +++ jcxz <+++++++++++ +
+ mov si,cx + + + mov di,cx + +
+ add si,di + + + shr cx,1 + +
+ rcr si,1 + + (2) + sub di,cx + +
+ dec si + + (3) + sub si,di + + (7)
+ cmp al,[bx+si] + + (10) + cmp al,[bx+si] + + (10)
+ jb ++++++++++++ + (11) + jb ++++++++++++ + (11)(12)
+ lea cx,[si+2] + (4) + lea cx,[di-1] + (4)
+ ja +++++++++++++++ (11) + ja +++++++++++++++ (11)
+ xchg ax,si (5) + xchg ax,si (5)
++> pop di ++> pop di
 pop si pop si
 pop cx pop cx
 ret ret

Method 1 (on the left) uses Left (CX) and Right (DI) points that gradually move closer to each other. A Mid (SI) point is derived from them through simple averaging.

Method 2 (on the right) uses Left (CX) and Right (DI) widths that when added represent the total width of the current list. The Mid (SI) point is placed at the start of the right partition.

By using 1-based indexes instead of the usual 0-based ones I managed to eliminate an extra test still retaining the necessary CF.
The previous addition produced a 17-bit result and so the CF was picked up.
The Mid point is still 0-based.
Very nice to have lea not changing the flags.
This xchg ax,si replaces the longer mov ax,si.
SI becomes Rear point. It proved more efficient to calculate the Midpoint coming from the rear of the current list rather than coming from the front of the current list.
SI becomes Midpoint.

This code can easily be adapted:

To search through an array of words replace this single cmp al,[bx+si] with push si shl si,1 cmp ax,[bx+si] pop si. It will even make the code run faster!

To search through an array of dwords replace the same cmp al,[bx+si] with push si shl si,2 cmp eax,[bx+si] pop si. It will make the code run even more faster!
With an array of signed values change jb into jl and ja into jg.
With an array of signed values make this jump go 1 line higher to stc.

In conclusion:

Method 2 is 3% shorter (1 byte) and 3% faster. Since both code snippets are so similar it's not surprising that they execute in about the same time. I could not be bothered by method 2 being faster because binary search typically does very few iterations and running this code on another computer will certainly show different results. I find both methods elegant but I've come to prefer the second one.

Would my conclusion still hold using a high level language?

I translated both methods to QBasic functions and saw that method 2 is now longer in lines, 10% slower, and a bit less elegant. This clearly demonstrates that using compound expressions is very beneficial to an interpreted language.

Also noteworthy is the use of the integer divide operator. It revealed a more than 50% speed increase over the usual divide operator.

FUNCTION Find1% (Arr%(), Num%) FUNCTION Find2% (Arr%(), Num%)
 Find1% = -1 Find2% = -1
 posL% = 1 sizL% = UBOUND(Arr%) + 1
 posR% = UBOUND(Arr%) + 1 posM% = sizL%
 DO UNTIL posR% < posL% DO WHILE sizL%
 posM% = ((posL% + posR%) \ 2) - 1 sizR% = sizL%
 IF Num% < Arr%(posM%) THEN sizL% = sizL% \ 2
 posR% = posM% sizR% = sizR% - sizL%
 ELSEIF Num% > Arr%(posM%) THEN posM% = posM% - sizR%
 posL% = posM% + 2 IF Num% > Arr%(posM%) THEN
 ELSE sizL% = sizR% - 1
 Find1% = posM% posM% = posM% + sizR%
 EXIT FUNCTION ELSEIF Num% = Arr%(posM%) THEN
 END IF Find2% = posM%
 LOOP EXIT FUNCTION
END FUNCTION END IF
 LOOP
 END FUNCTION
; ----------------------------------------------------------------------

Question 2

The number of 'lines' in the qbasic is pretty meaningless. For example it's not like there's anything to execute for END IF. In general, it seems like any difference between the assembly version and the qbasic version are simply a reflection of the efficiency of the qbasic compiler. Although I'm not familiar with qbasic. Is it an interpreter?

Question 3

Yes, QBasic is an interpreter. QBasic is actually a stripped-down version of Microsoft's QuickBASIC that was shipped with MS-DOS. QuickBASIC was a commercial development package that included a compiler, but the compiler was removed in the free QBasic. It is rather common for people to mix up which one they're using, since they are source-code compatible. @david

Question 4

There is one obvious reason to prefer method 2, and that is the possibility of overflow when calculating the midpoint in method 1. Adding the left and right widths together first, before dividing, risks the intermediate result of the addition overflowing the destination type.

This is not a problem in the assembly code, because you're explicitly treating the values as unsigned and the semantics are well-defined.

It is, however, a problem in the QBasic translation, since QBasic doesn't have unsigned types. I'm not familiar enough with the QBasic language specification to say exactly what happens in the event of an overflow, but I believe (based on what I know about its descendant language, Visual Basic) that overflow checks will be inserted after every arithmetic operation (unless you've disabled them in the compiler options, which won't exist for the interpreted-only QBasic). Specifically, jo instructions will conditionally jump to an error-handling routine in the event of an overflow, which will raise an Error with Basic language semantics. In addition to slowing down execution, you'll have to ensure that you either handle this error (which will slow things down more) or suffer the consequences of code that breaks in edge cases.

I agree that your assembly code is elegant, well-written, and clear.

(Although, for what it's worth, I find the arrows showing the flow of control more confusing than the normal labels and branch targets. The similar-looking arrows all blend together when I try to study them. Maybe it's because I'm not a visual learner.)

method 2 is 3% shorter (1 byte) and 3% faster. Since both code snippets are so similar it's not surprising that they execute in about the same time.

I don't know what processor you're running your performance tests on in order to claim this result, but the fact that you've made code shorter does not necessarily make it faster.
_{(I'm going to talk about a bunch of obsolete architectures because I'm assuming that's the most likely target for 16-bit x86 assembly code! I'll try to throw in some notes on modern processors, too.)}

On the 8086, and especially the 8088, it is true that shorter code is almost always better, since instruction prefetching is so slow and there is no cache. Although there are some instructions that are slower than others, you can pretty much substitute a multi-byte instruction for a single-byte instruction and see a speed-up, as long as the cycle counts are comparable.

That was no longer so true on the 286, and become completely untrue on the 386 and later. Especially on the 386, there are a bunch of "legacy" CISC-style instructions that are very slow. Replacing them with simpler instructions (regardless of whether or not they are shorter in terms of bytes) almost always makes the code execute more quickly. And thus was introduced the classic space vs. size tradeoff for optimizing compilers.

A case in point is the jcxz instruction, which tests the value in the cx register to see if it is equal to zero and, if so, executes a conditional branch. Notice that this is equivalent to:

test cx, cx ; a faster way to do "cmp cx, 0"
je BranchTarget

Even though jcxz is 2 bytes compared to 2+2 bytes for test+je, jcxz is slower. On the 286–486, jcxz executes in ~8 clock cycles when the branch is taken and ~5 clock cycles when the branch is not taken. The alternative is 1 cycle for the test reg, reg, plus 7 cycles on 286/386 when je is taken or 3 cycles when not taken (on 486, je is even faster—3 cycles when taken, 1 cycle otherwise). Thus, in the worst case on the 286/386, test+je is equally as fast as jcxz, while in the best case, it is about twice as fast. On the 486, test+je murders jcxz speed-wise. Intel's optimization manuals at this point in history explicitly recommended using test+je instead of jcxz.

Similarly, the 2-byte xchg reg, reg was generally the fastest way to exchange two enregistered values on the 808x. It executed in 4 cycles, which was the same as two mov reg, reg instructions (2 bytes and 2 cycles each), but it was half the size, so it was quicker to fetch. Things changed on the 286, and on all subsequent processors. xchg reg, reg was now a 3-cycle instruction, but mov reg, reg became a 1-cycle instruction, so doing two movs was faster than a single xchg. That is, assuming you had an extra register to spare. If not, given the limited register set, you might still have been better off with xchg. The 486 changed that yet again: since xor reg, reg ran in only 1 cycle, you could use three xors in the classic swap trick as fast or faster than a single xchg (faster if you could interleave the three xors within other instructions you needed to execute anyway to reduce dependencies).

The Pentium had dual execution engines, known as the U and V pipes. The U pipe was basically a full-featured 486, while the V pipe was more limited and could only execute a subset of instructions. This made instruction pairing a big deal, because if you properly paired your instructions such that one could execute on the U pipe while the other executed on the V pipe, you could effectively double performance. The xor trick becomes even more attractive here than a single xchg, because xor paired on either pipes and could be even more effectively interleaved within a complicated sequence of instructions. Pentium Pro introduced out-of-order execution, which made perfect instruction pairing less important, but much of the same optimization logic still applied. The CISC-style instructions are very slow on modern processors and should not be used even when they are fewer bytes. The days of rote cycle-counting are long gone, however.

All of that to say that you can probably improve the performance of method 2 even further by choosing more optimal instruction sequences, even if that results in larger code.

Another way to speed the code up is to replace conditional branches with clever bitwise or arithmetic operations that accomplish the same thing branchlessly. This is a huge performance win on the 808x and extremely significant on any processor that lacks a branch predictor (introduced with the Pentium). Even on modern processors, there are massive penalties if branch prediction is unsuccessful (not strongly biased for taken or not-taken), so rewriting branching code to be branchless can still result in a performance speed-up. You've got a tight loop here, the body of which is a perfect place to apply such optimizations. This is equally true for either method (and, at first glance, it looks like the branches could be more elegantly removed in method 1).

You could also ditch the relatively-slow stc instruction if you replaced jcxz with cmp+je, instead of test+je. Normally, you would prefer test for a comparison against zero, but in this case, cmp is likely a better choice, because it sets the carry flag (CF) according to its result, while test does not affect CF.

One thing that likely makes a big difference in the relative performance of method 1 and method 2 is replacing the rcr instruction with a shr instruction. Or at least, depending on which processor you're using to do the timing. rcr reg, 1 is extremely slow on the 386—about three times slower than shr reg, 1. The difference isn't there on 808x, 286, Pentium I/II/III, and most AMD processors. However, on recent Intel processors (starting from Pentium M, extending through Core 2, and later up to current microarchitectures, including Atom), rcr is once again about three times slower than shr (higher latency and less throughput).

Finally, you aren't seeing it here because you're doing an apples-to-apples comparison, but 16-bit instructions are one-byte longer (because of a size prefix) and noticeably slower when running in 32-bit protected mode. If you're not running in real mode, you could speed up the code "for free" by switching to 32-bit instructions, using movzx (or xor r32, r32 + mov r32, r/m16) to initialize 32-bit registers with unsigned 16-bit integer values.

^{A brief detour to consider an especially confounding optimization case—at least, historically speaking…
The following instruction: cmp al, [bx+si] has a register as its destination operand and a memory address as its source operand. On the 286 and 386, this instruction executes in 6 clocks. Had we swapped the order of the parameters, so that the memory address was the destination operand and the register was the source operand, it would execute slightly slower on the 286 (7 clocks) but slightly faster on the 386 (5 clocks).

Either form could be used, since cmp doesn't modify its destination operand. You'd just have to adjust the direction of the subsequent branch, since the flags will be set for the opposite comparison. The struggle is, which one do you use? If you're optimizing for the 286, you'd use cmp mem, reg, but if you're optimizing for the 386, you'd use cmp reg, mem. Fortunately, on the 486 and all later processors, the choice is irrelevant because they execute at identical speeds.}

Would my conclusion still hold using a high level language?

Critically missing here is the word interpreted. The language's level of abstraction does not matter as long as it is compiled to machine code, preferably by an optimizing compiler. Theoretically, you could write this in any high-level language of your choosing and it would be translated to assembly code that is equivalent if not even more optimal than what you've written by hand.

In fact, to demonstrate that, here is a C++ translation of your QBasic code for Find2:

uint32_t Find2(uint32_t* pArr, uint32_t cArr, uint32_t num)
{
 uint32_t result = static_cast<uint32_t>(-1);
 uint32_t sizL = cArr;
 uint32_t posM = sizL;
 while (sizL)
 {
 uint32_t sizR = sizL;
 sizL /= 2U;
 sizR -= sizL;
 posM -= sizR;
 if (num == pArr[posM])
 {
 return posM;
 }
 else if (num > pArr[posM])
 {
 sizL = (sizR - 1U);
 posM += sizR;
 }
 }
 return result;
}

And here is the object code generated by MSVC 2010, targeting a modern 32-bit x86 processor:

 Find2 PROC
 mov eax, DWORD PTR [cArr]
 push ebx
 push esi
 push edi
 mov ecx, eax
 test eax, eax
 je SHORT L3
 mov edi, DWORD PTR [num]
 mov ebx, DWORD PTR [pArr]
 L1:
 mov edx, ecx ; the following inner-loop sequence looks *very* familiar!
 shr ecx, 1 
 sub edx, ecx 
 sub eax, edx 
 mov esi, DWORD PTR [ebx+eax*4] ; your CMP was split up into two separate load-compare
 cmp edi, esi ; instructions b/c this is faster on PPro and later
 je SHORT L4
 jbe SHORT L2 ; not sure why JBE is used instead of JB; doesn't matter though
 lea ecx, DWORD PTR [edx-1]
 add eax, edx
 L2:
 test ecx, ecx ; this seems a poor optimization choice, to branch to a separate
 jne SHORT L1 ; block here instead of sharing the instructions preceding the
 ; loop (possibly forced into this by poor register allocation?)
 ; 
 L3:
 or eax, -1 ; fast way to set register to -1 on older CPUs, but
 ; may be slower on modern CPUs due to false dependency
 L4:
 pop edi
 pop esi
 pop ebx
 ret
Find2 ENDP

Okay, compilers aren't [yet] quite as good as we might hope, but still, there are a lot of similarities between what it generates and your hand-written assembly code, especially in the inner loop. I've annotated these similarities, a few interesting decisions made by the optimizer, and one of the questionable optimization choices. Another big difference is that the compiler isn't returning a result in the carry flag, and is therefore unable to utilize it in as clever a way as your hand-written assembly does. You can see this manifest most visibly in the need to explicitly initialize a register with −1 in order to return it.

Note, however, the code we're analyzing is just a literal translation of your QBasic code into C++ (which is itself a pretty literal translation of the assembly code into QBasic). I suspect that there is a more clever way to write the C++ code that will generate even better output from the compiler. You obviously spent quite a bit of time and gave quite a bit of thought to tweaking that assembly code. Imagine if you spent the same amount of time thinking, tweaking, and refining the C++ code. I think you could get very close performance-wise. I suspect the same is probably true of QBasic.

I translated both methods to QBasic functions and saw that method 2 is now longer in lines, 10% slower, and a bit less elegant. This clearly demonstrates that using compound expressions is very beneficial to an interpreted language.

I don't think that's a valid conclusion. That's not to say it isn't true that compound expressions execute more quickly in an interpreted language, but I don't think you have "clearly demonstrated" that here. Nor do I think you have proven that method 1 is superior to method 2 in interpreted languages. All you've proven is that you managed to translate method 1 into code that runs more efficiently in QBasic, whereas you weren't able to do the same for method 2.

I can trivially write method 1 using the same number of lines and the same looping construct as method 2—do you think it'll run more slowly now?

FUNCTION Find1% (Arr%(), Num%)
 Find1% = -1
 posL% = 1
 posR% = UBOUND(Arr%) + 1
 DO WHILE posR% >= posL%
 posM% = posL% + posR%
 posM% = posM% \ 2
 posM% = posM% - 1
 IF Num% < Arr%(posM%) THEN
 posR% = posM%
 ELSEIF
 posL% = posM% + 2
 ELSE
 Find1% = posM%
 EXIT FUNCTION
 END IF
 LOOP
END FUNCTION

This could suggest a more interesting and more fair strategy for performance comparison.

Also noteworthy is the use of the integer divide operator. It revealed a more than 50% speed increase over the usual divide operator.

Noteworthy, perhaps, but not surprising. If you don't use the integer divide operator, you get a floating-point division. Semantically, then, this is really just a bug: you're dividing two integers using the floating-point division operator.

Why is it so much slower to do the division in floating-point? Well, there are a couple of reasons. First, you have to pay for a conversion from the integer value into a floating-point value, and then a conversion from the floating-point result back into an integer value. Aside from the fact that this pointlessly injects a bunch of extra instructions, it also vastly increases the latency. Second, there's the fact that the QBasic system is so old that it can only emit instructions that use the x87 FPU, which is less efficient than SSE/SSE2 instructions and slower to move values into and out of because they have to round-trip through memory. In fact, it is probably even worse than that. Given its age, QBasic probably doesn't even use FPU instructions. Back in the day, not all systems had FPUs installed, and floating-point operations had to be simulated using a routines provided by a runtime library, where you not only had to pay the cost of a function call, but the simulated code was much slower than a simple FDIV.

Question 5

Huh. I'm using VS2010, but that's not the assembler it's generating for me at all. We must be using different optimization settings? That's not how /FA is formatting my output either. Honestly, that L1 L2 stuff looks more like gcc? I was going to point out an amusing optimization quirk that produced better code in VS2010, but there's not much point since our output is so very different. Still, an excellent analysis on this rather old question. Hopefully OP will come back and accept.

Question 6

@david I reformatted the output for readability, including changing the labels, but didn't change the code itself. This is at /O2 optimization, targeting 32-bit. I just checked again to make sure I didn't accidentally use the wrong output. It does get shorter if you use /O1, but in my experience, MSVC does a number of dumb things at that optimization level, so I never use it globally. In this case, /O1 causes it to re-load values from memory inside of the loop, which doesn't make a lot of sense for an inner loop. What is interesting is that VS 2010 seems to produce better code than VS 2015.

Question 7

Please forgive me my neglect of your answer these past years. It truly is a very interesting review! One observation about replacing stc + jcxz with cmp + je. This could never produce the wanted CF=1 to indicate failure. What will work is cmp cx, 1 + jb. Again, sorry for the long wait.

Cody Gray Cody Gray 4,56719 silver badges30 bronze badges · Accepted Answer · 2017-01-06 15:33:31Z

There is one obvious reason to prefer method 2, and that is the possibility of overflow when calculating the midpoint in method 1. Adding the left and right widths together first, before dividing, risks the intermediate result of the addition overflowing the destination type.

This is not a problem in the assembly code, because you're explicitly treating the values as unsigned and the semantics are well-defined.

It is, however, a problem in the QBasic translation, since QBasic doesn't have unsigned types. I'm not familiar enough with the QBasic language specification to say exactly what happens in the event of an overflow, but I believe (based on what I know about its descendant language, Visual Basic) that overflow checks will be inserted after every arithmetic operation (unless you've disabled them in the compiler options, which won't exist for the interpreted-only QBasic). Specifically, jo instructions will conditionally jump to an error-handling routine in the event of an overflow, which will raise an Error with Basic language semantics. In addition to slowing down execution, you'll have to ensure that you either handle this error (which will slow things down more) or suffer the consequences of code that breaks in edge cases.

I agree that your assembly code is elegant, well-written, and clear.

(Although, for what it's worth, I find the arrows showing the flow of control more confusing than the normal labels and branch targets. The similar-looking arrows all blend together when I try to study them. Maybe it's because I'm not a visual learner.)

method 2 is 3% shorter (1 byte) and 3% faster. Since both code snippets are so similar it's not surprising that they execute in about the same time.

I don't know what processor you're running your performance tests on in order to claim this result, but the fact that you've made code shorter does not necessarily make it faster.
_{(I'm going to talk about a bunch of obsolete architectures because I'm assuming that's the most likely target for 16-bit x86 assembly code! I'll try to throw in some notes on modern processors, too.)}

On the 8086, and especially the 8088, it is true that shorter code is almost always better, since instruction prefetching is so slow and there is no cache. Although there are some instructions that are slower than others, you can pretty much substitute a multi-byte instruction for a single-byte instruction and see a speed-up, as long as the cycle counts are comparable.

That was no longer so true on the 286, and become completely untrue on the 386 and later. Especially on the 386, there are a bunch of "legacy" CISC-style instructions that are very slow. Replacing them with simpler instructions (regardless of whether or not they are shorter in terms of bytes) almost always makes the code execute more quickly. And thus was introduced the classic space vs. size tradeoff for optimizing compilers.

A case in point is the jcxz instruction, which tests the value in the cx register to see if it is equal to zero and, if so, executes a conditional branch. Notice that this is equivalent to:

test cx, cx ; a faster way to do "cmp cx, 0"
je BranchTarget