Find missing number in unsorted input with memory constraints

Question 1

I was reading about the problem of finding the missing number from a series of 4 billion 32 bit integers in Programming Pearls, but could not quite get the solution.

Given a sequential file that contains at most 4 Billion 32-bit integers in random order, find a 32-bit integer not on the file.
Constraint: Few hundrends of bytes of main memory but ample use of extrernal scratch files on disk

The solution described is a process where we separate the numbers in ranges using the the upper bits (I.e. in the first pass we write those with leading 0 bit to one file and those with leading 1 bit to another.We keep going using the second bit etc.) and divide in half using as new search range the half containing less than half of the numbers in the range.

I googled and found a similar post in SO which does not quite answer my question which is how the exact number is found (I undestand how the binary search fits in separate the ranges).
The answer of @Damien_The_Unbeliever seems the most relevant but from my point of view I thought that following the process we would end up with 2 files: A file with 2 numbers in range and a file with 1 number.
By subtracting the (one) number in one file with the largest of the others we can get a missing number (no need of bitmask and I am not quite sure if we could actually apply a bitmask since the range is unknown at any point).

Am I wrong on this? Could someone help figure this out?

Question 2

possible duplicate of Find a missing 32bit integer among a unsorted array containing at most 4 billion ints

Question 3

@MitchWheat:I have already linked to that post and it does not answer my question.That post is about understanding the process to separate range.My question is how to find the actual number.This question is never addressed or answered in that post except a hint by Damien_The_Unbeliever that I already mention

Question 4

You do not need any external storage. Instead of separating numbers by bit k, just count them separately.

Question 5

@n.m. : I'm not quite sure what you mean. Could you elaborate in a full answer?

Question 6

@Li-aung Yip: please ignore my comment, this method is not efficient.

Question 7

You needn't copy the data itself or write anything to disk; just count the members of some partition of the data to identify openings. The tradeoff is between number of passes and memory (more memory allows for more counts, smaller partitions).

Here's a solution in 8 passes. We'll partition the data using 4 bits at a time. 2^4 = 16 possible values, so we'll need 64 bytes to store counts for each of the 16 partitions. So each 4 bit nibble value has an associated count.

Make a pass through the data, incrementing the associated count matching the nibble in the first four bits of the number. If a partition is full, its count will be 2^28. Pick one of the nibbles that isn't full --- this will be the first nibble of your result.

Zero your counts and make another pass, ignoring numbers that don't match the first nibble and incrementing the count associated with the second nibble in the number. A full second nibble will have a value of 2^24. Pick one that isn't full.

Proceed in this manner until you have all 8 nibbles, and there's your answer in O(N).

If you only check one bit at a time, ~~(削除) this would be a binary search (削除ここまで)~~ requiring 32 passes. (EDIT: Not really a binary search, since you still have to read the values that you're skipping. That's why it's O(N). See edit below.) If you have a KB of memory for counts, you can do it in 4 passes; with 256 KB you can do it in 2 --- actually 128 KB since you could then use short ints for your counts. Here we're constrained to a few hundred bytes --- maybe 6 bits/6 passes/256 bytes?

EDIT: Li-aung Yip's solution scales better, and clearly it can be modified to partition by more than one bit in a pass. If the writing is slower than reading, then maybe the best solution would be a hybrid between this read-only answer and Li-aung Yip's disk based one.

Do a pass counting nibbles as above, then as you count the second set of nibbles, write only the numbers (or possibly just the last 28 bits of them) that match the first nibble, into 16 files according to the second nibble.

Pick your second nibble and read it to get counts for the third nibble, writing only the numbers matching the second nibble, etc. If the file is close to capacity, if the numbers are fairly uniformly distributed, or if you pick the least-full nibbles each time, you'll have to write no more than about 6.66% (1/16+1/16^2...) of the file size.

Question 8

I can not follow your solution.Especially how you end up needing 64 bytes for counts etc.Could you elaborate on this?

Question 9

Sure. For convenience, lets talk about our integers in hexadecimal, so each character is 4 bits. On our first pass, we count how many integers start with 0, with 1, with 2, ... e, f. That's sixteen different starting characters, so we'll need 16 integers, or 64 bytes, to count them all in one pass. In the 28 bits that remain, there could be up to 2^28 different combinations for each of the starting characters; if one has less than that, we know that there are numbers beginning with that character which are not in our data, so we can restrict our search to that subset, and so on.

Question 10

After repeated binary partitioning of your numbers into smaller and smaller files, you'll end up with:

a bunch of files that contain two numbers, which only differ in their last significant bit
one file with only one number in it.

Get the missing number by flipping the last bit of the number that's in a file by itself.

Take the example of the numbers from 0x00 to 0x07, missing 0x04:

000
001
010
011
... (missing)
101
110
111

Take 101, flip the least significant bit, and get 100, which is the missing 0x04.

Question 11

Note: I assume there's only one missing integer.

Question 12

+1.This is a better description of how I also suspected the solution.It seems that a bitmap is useless here

Question 13

I don't see how this would change the original problem.There is at least 1 missing and this way we would still get one.Or I am missing your point?

Question 14

It's a bit more complicated if there are two or more numbers missing. What happens if 010 and 011 are both missing in the example above? (It's a trivial problem, it just means you can't do it exactly the way I've described.)

Question 15

Well the problem specifies that there are more numbers than 1 missing and if exactly 1 number was missing from the range, I wouldn't bother with that solution.I would just add all the numbers and subtract from N(N+1)/2 to get the number directly

Question 16

4 Billion integers are representable using a 32 bit integer. XOR ing a number with itself is a standard trick to zero out a register in assembly code. If you are guaranteed that only one number is missing, bitwise XOR on integers comes to the rescue, an O(N) solution, requiring only one additional 32 bit integer of space. Consider a simple example, a 3 bit number, thus numbers 0-7 representable and one of them is missing. Assume 6 (110) is missing missing = n1 XOR n2 XOR n3 XOR .. XOR n7. = 000 XOR 001 XOR 010 XOR 011 XOR 100 XOR 101 XOR 111

Had the problem been find the missing number between 1 and 100, you would need to start of my xoring out the elements that must be excluded. AND could be used drop from a 32 bit integer range to smaller ranges by masking out bits in the number.

score 3 · Accepted Answer · 2012-04-14 23:07:57Z