2
$\begingroup$

Say I'm constantly given integers from the range $[1,2^{32}]$ in a random order and have to store these so that when a duplicate arrives I can deal with it. By the end of this algorithm all 2ドル^{32}$ values will have been treated.

What's the best way to store which integers have been treated so far so that we can search if the current integer is a duplicate?

asked Mar 16, 2019 at 4:20
$\endgroup$
2
  • 1
    $\begingroup$ While I generally agree with the questions D.W. is suggesting, in this case, since you say you know all numbers in the range will show up eventually, a simple bitmap is almost certainly going to dominate on all axes. $\endgroup$ Commented Mar 16, 2019 at 4:33
  • $\begingroup$ @D.W. The reason I didn't specify time vs. space being my main criteria is because I assume if there were two options to those ends we could consider both. As for how many integers, that's mentioned already; how much memory, that's included in time vs. space; what architecture, I would have thought the assumed architecture would be your run of the mill desktop Intel/AMD x86_64 processor unless you meant what microarchitecture they were using but I don't think that's relevant. $\endgroup$ Commented Mar 16, 2019 at 5:04

2 Answers 2

6
$\begingroup$

Since you know you're going to have to deal with all 2ドル^{32}$ values eventually, you're going to need at least 2ドル^{32}$ bits of memory, one for each value. The pigeonhole principle means that there's no possible way to store all the information you need with fewer bits than this.

So I recommend a straightforward bitmap. In other words, a simple array of bits. When a new value comes in, see if the corresponding bit is on or not; if it's off, flip it on. This takes the minimum possible amount of space, and is extremely fast (since all you have to do is index into an array).

P.S. You need 2ドル^{32}$ bits specifically because the numbers could appear in any order. This means that there are 2ドル^{2^{32}}$ possible states the program could be in, one for every possible combination of seen and not-seen values. Representing this many states will always take at least 2ドル^{32}$ bits. If you knew the numbers would always come in increasing order, on the other hand, there'd only be 2ドル^{32}$ states to distinguish between (since all you care about is the highest value seen so far), so you'd need a minimum of 32ドル$ bits.

answered Mar 16, 2019 at 5:04
$\endgroup$
5
  • $\begingroup$ If the values appeared in order, it would also be the case that we have to deal with all 2^32 values eventually, but somehow we only need 32 bits this time, so that argument doesn't work. It's passing through the halfway point with 2^32 uniformly random bits that is interesting $\endgroup$ Commented Mar 16, 2019 at 6:37
  • $\begingroup$ @harold It has to do with how much information we need to save. In this case there are 2ドル^{2^{32}}$ possible states we need to distinguish between. $\endgroup$ Commented Mar 16, 2019 at 14:58
  • $\begingroup$ @Draconis There are 2ドル^{2^{32}}$ possible states because the values can appear in any order. There are only 2ドル^{32}$ states if the values appear in order. I was going to say almost exactly the same thing as harold but decided not to. Focusing on the pigeonhole principle without saying why or even that there are 2ドル^{2^{32}}$ possible states is odd. It's like saying there is an infinitude of primes because of the pigeonhole principle. Your first paragraph is just weirdly and misleadingly written. $\endgroup$ Commented Mar 16, 2019 at 19:54
  • $\begingroup$ @DerekElkins Added an explanation of that $\endgroup$ Commented Mar 16, 2019 at 19:58
  • $\begingroup$ That last point about how knowledge of the order of the integers would decrease the required bits is a nice addition. $\endgroup$ Commented Mar 17, 2019 at 0:39
1
$\begingroup$

If the integers truly arrive in random order, then a bit array is a good (likely ideal) solution, and would require 512MB.

In the real world, information may be unpredictable, but not random, so a number of data structures optimized for sparse information (including sparse missing information) could be more appropriate if memory usage is a concern. For example, a binary tree of run-length encodings.

Assuming that the integers do not arrive in a truly random order, then the more you know about the pattern, the more potential optimizations will be available to you.

answered Mar 18, 2019 at 12:53
$\endgroup$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.