I was reading about the problem of finding the missing number from a series of 4 billion 32 bit integers in Programming Pearls
, but could not quite get the solution.
Given a sequential file that contains at most 4 Billion 32-bit integers in random order, find a 32-bit integer not on the file.
Constraint: Few hundrends of bytes of main memory but ample use of extrernal scratch files on disk
The solution described is a process where we separate the numbers in ranges using the the upper bits (I.e. in the first pass we write those with leading 0
bit to one file and those with leading 1
bit to another.We keep going using the second bit etc.) and divide in half using as new search range the half containing less than half of the numbers in the range.
I googled and found a similar post in SO which does not quite answer my question which is how the exact number is found (I undestand how the binary search fits in separate the ranges).
The answer of @Damien_The_Unbeliever seems the most relevant but from my point of view I thought that following the process we would end up with 2 files: A file with 2 numbers in range and a file with 1 number.
By subtracting the (one) number in one file with the largest of the others we can get a missing number (no need of bitmask and I am not quite sure if we could actually apply a bitmask since the range is unknown at any point).
Am I wrong on this? Could someone help figure this out?
3 Answers 3
You needn't copy the data itself or write anything to disk; just count the members of some partition of the data to identify openings. The tradeoff is between number of passes and memory (more memory allows for more counts, smaller partitions).
Here's a solution in 8 passes. We'll partition the data using 4 bits at a time. 2^4 = 16 possible values, so we'll need 64 bytes to store counts for each of the 16 partitions. So each 4 bit nibble value has an associated count.
Make a pass through the data, incrementing the associated count matching the nibble in the first four bits of the number. If a partition is full, its count will be 2^28. Pick one of the nibbles that isn't full --- this will be the first nibble of your result.
Zero your counts and make another pass, ignoring numbers that don't match the first nibble and incrementing the count associated with the second nibble in the number. A full second nibble will have a value of 2^24. Pick one that isn't full.
Proceed in this manner until you have all 8 nibbles, and there's your answer in O(N).
If you only check one bit at a time, (削除) this would be a binary search (削除ここまで) requiring 32 passes. (EDIT: Not really a binary search, since you still have to read the values that you're skipping. That's why it's O(N). See edit below.) If you have a KB of memory for counts, you can do it in 4 passes; with 256 KB you can do it in 2 --- actually 128 KB since you could then use short ints for your counts. Here we're constrained to a few hundred bytes --- maybe 6 bits/6 passes/256 bytes?
EDIT: Li-aung Yip's solution scales better, and clearly it can be modified to partition by more than one bit in a pass. If the writing is slower than reading, then maybe the best solution would be a hybrid between this read-only answer and Li-aung Yip's disk based one.
Do a pass counting nibbles as above, then as you count the second set of nibbles, write only the numbers (or possibly just the last 28 bits of them) that match the first nibble, into 16 files according to the second nibble.
Pick your second nibble and read it to get counts for the third nibble, writing only the numbers matching the second nibble, etc. If the file is close to capacity, if the numbers are fairly uniformly distributed, or if you pick the least-full nibbles each time, you'll have to write no more than about 6.66% (1/16+1/16^2...) of the file size.
2 Comments
64
bytes for counts etc.Could you elaborate on this?After repeated binary partitioning of your numbers into smaller and smaller files, you'll end up with:
- a bunch of files that contain two numbers, which only differ in their last significant bit
- one file with only one number in it.
Get the missing number by flipping the last bit of the number that's in a file by itself.
Take the example of the numbers from 0x00 to 0x07, missing 0x04:
000
001
010
011
... (missing)
101
110
111
Take 101
, flip the least significant bit, and get 100
, which is the missing 0x04
.
9 Comments
010
and 011
are both missing in the example above? (It's a trivial problem, it just means you can't do it exactly the way I've described.)N(N+1)/2
to get the number directly4 Billion integers are representable using a 32 bit integer. XOR ing a number with itself is a standard trick to zero out a register in assembly code. If you are guaranteed that only one number is missing, bitwise XOR on integers comes to the rescue, an O(N) solution, requiring only one additional 32 bit integer of space. Consider a simple example, a 3 bit number, thus numbers 0-7 representable and one of them is missing. Assume 6 (110) is missing missing = n1 XOR n2 XOR n3 XOR .. XOR n7. = 000 XOR 001 XOR 010 XOR 011 XOR 100 XOR 101 XOR 111
Had the problem been find the missing number between 1 and 100, you would need to start of my xoring out the elements that must be excluded. AND could be used drop from a 32 bit integer range to smaller ranges by masking out bits in the number.
Comments
Explore related questions
See similar questions with these tags.
Damien_The_Unbeliever
that I already mention