6

I am reading book on programming pearls.

Question: Given a sequential file that contains at most four billion 32 bit integers in random order, find a 32-bit integer that isn't in the file (and there must be at least one missing). This problem has to be solved if we have a few hundred bytes of main memory and several sequential files.

Solution: To set this up as a binary search we have to define a range, a representation for the elements within the range, and a probing method to determine which half of a range holds the missing integer. How do we do this?

We'll use as the range a sequence of integers known to contain atleast one missing element, and we'll represent the range by a file containing all the integers in it. The insight is that we can probe a range by counting the elements above and below its midpoint: either the upper or the lower range has atmost half elements in the total range. Because the total range has a missing element, the smaller half must also have a mising element. These are most ingredients of a binary search algorithm for above problem.

Above text is copy right of Jon Bently from programming pearls book.

Some info is provided at following link

"Programming Pearls" binary search help

How do we search by passes using binary search and also not followed with the example given in above link? Please help me understand logic with just 5 integers rather than million integers to understand logic.

asked Sep 7, 2012 at 8:55

5 Answers 5

3

Why don't you re-read the answer in the post "Programming Pearls" binary search help. It explains the process on 5 integers as you ask.
The idea is that you parse each list and break it into 2 (this is where binary part comes from) separate lists based on the value in the first bit.

I.e. showing binary representation of actual numbers Original List "": 001, 010, 110, 000, 100, 011, 101 => (broken into)
(we remove the first bit and append it to the "name" of the new list)
To form each of the bellow lists we took values starting with [0 or 1] from the list above
List "0": 01, 10, 00, 11 (is formed from subset 001, 010, 000, 011 of List "" by removing the first bit and appending it to the "name" of the new list)
List "1": 10, 00, 01 (is formed from subset 110, 100, 101 of List "" by removing the first bit and appending it to the "name" of the new list)

Now take one of the resulting lists in turn and repeat the process:
List "0" becomes your original list and you break it into
List "0***0**" and
List "0***1**" (the bold numbers are again the 1 [remaining] bit of the numbers in the list being broken)

Carry on until you end up with the empty list(s).

EDIT
Process step by step:
List "": 001, 010, 110, 000, 100, 011, 101 =>
List "0": 01, 10, 00, 11 (from subset 001, 010, 000, 011 of the List "") =>
List "00": 1, 0 (from subset 01, 00 of the List "0") =>
List "000": 0 [final result] (from subset 0 of the List "00")
List "001": 1 [final result] (from subset 1 of the List "00")
List "01": 0, 1 (from subset 10, 11 of the List "0") =>
List "010": 0 [final result] (from subset 0 of the List "01")
List "011": 1 [final result] (from subset 1 of the List "01")
List "1": 10, 00, 01 (from subset 110, 100, 101 of the List "") =>
List "10": 0, 1 (from subset 00, 01 of the List "1") =>
List "100": 0 [final result] (from subset 0 of the List "10")
List "101": 1 [final result] (from subset 1 of the List "10")
List "11": 0 (from subset 10 of the List "1") =>
List "110": 0 [final result] (from subset 0 of the List "11")
List "111": absent [final result] (from subset EMPTY of the List "11")

The positive of this method is that it will allow you to find ANY number of missing numbers in the set - i.e. if more than one is missing.

P.S. AFAIR for 1 single missing number out of the complete range there is even more elegant solution of XOR all numbers.

answered Sep 7, 2012 at 9:08

10 Comments

How you please eloborate ho we do this with XOR all numbers?
The XOR all numbers method relies on the fact that the result of 1 XOR 0 = 0. Numbers that complement each other will result in 0 and only the number that does NOT have a complement will remain. Again based on 3 bit numbers (the principal works for ANY number of bits): Given a list [as above] 001, 010, 110, 000, 100, 011, 101 => 001 complements 110 (in 3 bits), so the result of 001 ^ 110 => 0...
Thanks for explanation. As mentioned above why did you take list "0"? and how did you break it into List"0*0*" and List "0*1**"? Sorry I am not following logic above. Can you please eloborate?
@GermannArlington 1 XOR 0 = 0 maybe you meant 1 XOR 0 = 1 ?
@GermannArlington which is false. XOR stands for "exclusive or" which means a XOR b is true only if a is true or b is true, not both.
|
1

The idea is to solve easier problem:

Is the missing value in range [minVal, X] or (X, maxVal). If you know this, you can move X and check again.

For example, you have 3, 4, 1, 5 (2 is missing). You know that minVal = 1, maxVal = 5.

  1. Range = [1, 5], X = 3, there should be 3 integers in range [1, 3] and 2 in range [4, 5]. There are only 2 in range [1, 3], so you are looking in range [1, 3]
  2. Range = [1, 3], X = 2. There are only 1 value in range [1, 2], so you are looking in range [1, 2]
  3. Range = [1, 2], X = 1. There are no values in range [2, 2] so it is your answer.

EDIT: Some pseudo-C++ code:

minVal = 1, maxVal = 5; //choose correct values
while(minVal < maxVal){
 int X = (minVal + maxVal) / 2
 int leftNumber = how much in range [minVal, X]
 int rightNumber = how much in range [X + 1, maxVal]
 if(leftNumber < (X - minVal + 1))maxVal = X
 else minVal = X + 1
}
answered Sep 7, 2012 at 9:08

7 Comments

It is floor((begin + end) / 2), in C++ you can write simply (begin + end) / 2.
Thanks for explanation. In stackoverflow.com/questions/5011589/… . Here we have 000, 001, 110, 100,111. Mentioned that after first pass we have 000, 001, 110, 100,111. Then we see second bit. I am understanding how after first pass we got same values again.
After first pass we know that missing number's first bit is 0. So Second iteration is checking if second bit is 0 or 1. There is no number with first bit 0 and second 1. All numbers with first bit 0 have second bit 0, so we got same values.
How Do we select that "leftNumber" and "rightNumber" fast ? guess each time it needs O(n) operations if you're counting them with a for loop :-?
@SpiXel Yes. It's because we need to solve problem in few hundred bytes of memory. So we can't load all data to memory.
|
1

Here's a simple C solution which should illustrate the technique. To abstract away any tedious file I/O details, I'm assuming the existence of the following three functions:

  • unsigned long next_number (void) reads a number from the file and returns it. When called again, the next number in the file is returned, and so on. Behavior when the end of file is encountered is undefined.

  • int numbers_left (void) returns a true value if there are more numbers available to be read using next_number(), false if the end of the file has been reached.

  • void return_to_start (void) rewinds the reading position to the start of the file, so that the next call to next_number() returns the first number in the file.

I'm also assuming that unsigned long is at least 32 bits wide, as required for conforming ANSI C implementations; modern C programmers may prefer to use uint32_t from stdint.h instead.

Given these assumptions, here's the solution:

unsigned long count_numbers_in_range (unsigned long min, unsigned long max) {
 unsigned long count = 0;
 return_to_start();
 while ( numbers_left() ) {
 unsigned long num = next_number();
 if ( num >= min && num <= max ) {
 count++;
 }
 }
 return count;
}
unsigned long find_missing_number (void) {
 unsigned long min = 0, max = 0xFFFFFFFF;
 while ( min < max ) {
 unsigned long midpoint = min + (max - min) / 2;
 unsigned long count = count_numbers_in_range( min, midpoint );
 if ( count < midpoint - min + 1 ) {
 max = midpoint; // at least one missing number below midpoint
 } else {
 min = midpoint; // no missing numbers below midpoint, must be above
 }
 }
 return min;
}

One detail to note is that min + (max - min) / 2 is the safe way to calculate the average of min and max; it won't produce bogus results due to overflowing intermediate values like the seemingly simpler (min + max) / 2 might.

Also, even though it would be tempting to solve this problem using recursion, I chose an iterative solution instead for two reasons: first, because it (arguably) shows more clearly what's actually being done, and second, because the task was to minimize memory use, which presumably includes the stack too.

Finally, it would be easy to optimize this code, e.g. by returning as soon as count equals zero, by counting the numbers in both halves of the range in one pass and choosing the one with more missing numbers, or even by extending the binary search to n-ary search for some n> 2 to reduce the number of passes. However, to keep the example code as simple as possible, I've left such optimizations unmade. If you like, you may want to, say, try modifying the code so that it requires at most eight passes over the file instead of the current 32. (Hint: use a 16-element array.)

answered Sep 7, 2012 at 11:11

3 Comments

Is this really a form of binary search? The input isn't sorted, so you have to go through all N elements to count the numbers in each range.
@BobTempl: It's a binary search in the space of 32-bit integers. We're not looking for the missing number in the file (which would be hopeless, since, by definition, it's not there); we're looking for it in the range of numbers from 0 to 0xFFFFFFFF. By counting the numbers in the intersection of a given subrange with the contents of the file, we can tell whether the missing number (or at least one of them, if several) is in that subrange.
Ah I see, that makes a lot more sense.
0

Actually, if we have range of integers from a to b. Sample: [a..b]. And in this range we have b-a integers. It means, that only one is missing. And if only one is missing, we can calculate result using only single cycle. First we can calculate sum of all integers in range [a..b], which equals:

sum = (a + b) * (b - a + 1) / 2

Then we calcualate summ of all integers in our sequence:

long sum1 = 0;
for (int i = 0; i < b - a; i++)
sum1 += arr[i];

Then we can find missing element as difference of those two sums:

long result = sum1 - sum;

answered Sep 7, 2012 at 11:35

Comments

0

when you've seen 2^31 zeros or ones in the ith digit place then your answer has a one or zero in the ith place. (Ex: 2^31 ones in 5th binary position means the answer has a zero in the 5th binary position.

First draft of c code:

uint32_t binaryHistogram[32], *list4BILLION, answer, placesChecked[32];
uint64_t limit = 4294967296;
uint32_t halfLimit = 4294967296/2;
int i, j, done
//General method to point to list since this detail is not important to the question.
list4BILLION = 0000000000h;
//Initialize array to zero. This array represents the number of 1s seen as you parse through the list
for(i=0;i<limit;i++)
{ 
 binaryHistogram[i] = 0;
}
//Only sum up for first half of the 4 billion numbers
for(i=0;i<halfLimit;i++)
{
 for(j=0;j<32;j++)
 {
 binaryHistogram[j] += ((*list4BILLION) >> j);
 }
}
//Check each ith digit to see if all halfLimit values have been parsed
for(i=halfLimit;i<limit;i++)
{
 for(j=0;j<32;j++)
 {
 done = 1; //Dont need to continue to the end if placesChecked are all 
 if(placesChecked[j] != 0) //Dont need to pass through the whole list
 {
 done = 0; //
 binaryHistogram[j] += ((*list4BILLION) >> j);
 if((binaryHistogram[j] > halfLimit)||(i - binaryHistogram[j] == halfLimit))
 {
 answer += (1 << j);
 placesChecked[j] = 1;
 }
 }
 }
}
answered May 31, 2014 at 23:26

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.