Say I'm constantly given integers from the range $[1,2^{32}]$ in a random order and have to store these so that when a duplicate arrives I can deal with it. By the end of this algorithm all 2ドル^{32}$ values will have been treated.
What's the best way to store which integers have been treated so far so that we can search if the current integer is a duplicate?
-
1$\begingroup$ While I generally agree with the questions D.W. is suggesting, in this case, since you say you know all numbers in the range will show up eventually, a simple bitmap is almost certainly going to dominate on all axes. $\endgroup$Derek Elkins left SE– Derek Elkins left SE2019年03月16日 04:33:30 +00:00Commented Mar 16, 2019 at 4:33
-
$\begingroup$ @D.W. The reason I didn't specify time vs. space being my main criteria is because I assume if there were two options to those ends we could consider both. As for how many integers, that's mentioned already; how much memory, that's included in time vs. space; what architecture, I would have thought the assumed architecture would be your run of the mill desktop Intel/AMD x86_64 processor unless you meant what microarchitecture they were using but I don't think that's relevant. $\endgroup$Ozaner Hansha– Ozaner Hansha2019年03月16日 05:04:39 +00:00Commented Mar 16, 2019 at 5:04
2 Answers 2
Since you know you're going to have to deal with all 2ドル^{32}$ values eventually, you're going to need at least 2ドル^{32}$ bits of memory, one for each value. The pigeonhole principle means that there's no possible way to store all the information you need with fewer bits than this.
So I recommend a straightforward bitmap. In other words, a simple array of bits. When a new value comes in, see if the corresponding bit is on or not; if it's off, flip it on. This takes the minimum possible amount of space, and is extremely fast (since all you have to do is index into an array).
P.S. You need 2ドル^{32}$ bits specifically because the numbers could appear in any order. This means that there are 2ドル^{2^{32}}$ possible states the program could be in, one for every possible combination of seen and not-seen values. Representing this many states will always take at least 2ドル^{32}$ bits. If you knew the numbers would always come in increasing order, on the other hand, there'd only be 2ドル^{32}$ states to distinguish between (since all you care about is the highest value seen so far), so you'd need a minimum of 32ドル$ bits.
-
$\begingroup$ If the values appeared in order, it would also be the case that we have to deal with all 2^32 values eventually, but somehow we only need 32 bits this time, so that argument doesn't work. It's passing through the halfway point with 2^32 uniformly random bits that is interesting $\endgroup$user555045– user5550452019年03月16日 06:37:41 +00:00Commented Mar 16, 2019 at 6:37
-
$\begingroup$ @harold It has to do with how much information we need to save. In this case there are 2ドル^{2^{32}}$ possible states we need to distinguish between. $\endgroup$Draconis– Draconis2019年03月16日 14:58:21 +00:00Commented Mar 16, 2019 at 14:58
-
$\begingroup$ @Draconis There are 2ドル^{2^{32}}$ possible states because the values can appear in any order. There are only 2ドル^{32}$ states if the values appear in order. I was going to say almost exactly the same thing as harold but decided not to. Focusing on the pigeonhole principle without saying why or even that there are 2ドル^{2^{32}}$ possible states is odd. It's like saying there is an infinitude of primes because of the pigeonhole principle. Your first paragraph is just weirdly and misleadingly written. $\endgroup$Derek Elkins left SE– Derek Elkins left SE2019年03月16日 19:54:55 +00:00Commented Mar 16, 2019 at 19:54
-
$\begingroup$ @DerekElkins Added an explanation of that $\endgroup$Draconis– Draconis2019年03月16日 19:58:18 +00:00Commented Mar 16, 2019 at 19:58
-
$\begingroup$ That last point about how knowledge of the order of the integers would decrease the required bits is a nice addition. $\endgroup$Ozaner Hansha– Ozaner Hansha2019年03月17日 00:39:21 +00:00Commented Mar 17, 2019 at 0:39
If the integers truly arrive in random order, then a bit array is a good (likely ideal) solution, and would require 512MB.
In the real world, information may be unpredictable, but not random, so a number of data structures optimized for sparse information (including sparse missing information) could be more appropriate if memory usage is a concern. For example, a binary tree of run-length encodings.
Assuming that the integers do not arrive in a truly random order, then the more you know about the pattern, the more potential optimizations will be available to you.
Explore related questions
See similar questions with these tags.