Efficient data structure for storing integers in a range?

Question 1

Say I'm constantly given integers from the range $[1,2^{32}]$ in a random order and have to store these so that when a duplicate arrives I can deal with it. By the end of this algorithm all 2ドル^{32}$ values will have been treated.

What's the best way to store which integers have been treated so far so that we can search if the current integer is a duplicate?

Question 2

While I generally agree with the questions D.W. is suggesting, in this case, since you say you know all numbers in the range will show up eventually, a simple bitmap is almost certainly going to dominate on all axes.

Question 3

@D.W. The reason I didn't specify time vs. space being my main criteria is because I assume if there were two options to those ends we could consider both. As for how many integers, that's mentioned already; how much memory, that's included in time vs. space; what architecture, I would have thought the assumed architecture would be your run of the mill desktop Intel/AMD x86_64 processor unless you meant what microarchitecture they were using but I don't think that's relevant.

Question 4

Since you know you're going to have to deal with all 2ドル^{32}$ values eventually, you're going to need at least 2ドル^{32}$ bits of memory, one for each value. The pigeonhole principle means that there's no possible way to store all the information you need with fewer bits than this.

So I recommend a straightforward bitmap. In other words, a simple array of bits. When a new value comes in, see if the corresponding bit is on or not; if it's off, flip it on. This takes the minimum possible amount of space, and is extremely fast (since all you have to do is index into an array).

P.S. You need 2ドル^{32}$ bits specifically because the numbers could appear in any order. This means that there are 2ドル^{2^{32}}$ possible states the program could be in, one for every possible combination of seen and not-seen values. Representing this many states will always take at least 2ドル^{32}$ bits. If you knew the numbers would always come in increasing order, on the other hand, there'd only be 2ドル^{32}$ states to distinguish between (since all you care about is the highest value seen so far), so you'd need a minimum of 32ドル$ bits.

Question 5

If the values appeared in order, it would also be the case that we have to deal with all 2^32 values eventually, but somehow we only need 32 bits this time, so that argument doesn't work. It's passing through the halfway point with 2^32 uniformly random bits that is interesting

Question 6

@harold It has to do with how much information we need to save. In this case there are 2ドル^{2^{32}}$ possible states we need to distinguish between.

Question 7

@Draconis There are 2ドル^{2^{32}}$ possible states because the values can appear in any order. There are only 2ドル^{32}$ states if the values appear in order. I was going to say almost exactly the same thing as harold but decided not to. Focusing on the pigeonhole principle without saying why or even that there are 2ドル^{2^{32}}$ possible states is odd. It's like saying there is an infinitude of primes because of the pigeonhole principle. Your first paragraph is just weirdly and misleadingly written.

Question 8

@DerekElkins Added an explanation of that

Question 9

That last point about how knowledge of the order of the integers would decrease the required bits is a nice addition.

Question 10

If the integers truly arrive in random order, then a bit array is a good (likely ideal) solution, and would require 512MB.

In the real world, information may be unpredictable, but not random, so a number of data structures optimized for sparse information (including sparse missing information) could be more appropriate if memory usage is a concern. For example, a binary tree of run-length encodings.

Assuming that the integers do not arrive in a truly random order, then the more you know about the pattern, the more potential optimizations will be available to you.

Draconis Draconis 7,2161 gold badge19 silver badges28 bronze badges · Accepted Answer · 2019-03-16 05:04:53Z

Since you know you're going to have to deal with all 2ドル^{32}$ values eventually, you're going to need at least 2ドル^{32}$ bits of memory, one for each value. The pigeonhole principle means that there's no possible way to store all the information you need with fewer bits than this.

So I recommend a straightforward bitmap. In other words, a simple array of bits. When a new value comes in, see if the corresponding bit is on or not; if it's off, flip it on. This takes the minimum possible amount of space, and is extremely fast (since all you have to do is index into an array).

P.S. You need 2ドル^{32}$ bits specifically because the numbers could appear in any order. This means that there are 2ドル^{2^{32}}$ possible states the program could be in, one for every possible combination of seen and not-seen values. Representing this many states will always take at least 2ドル^{32}$ bits. If you knew the numbers would always come in increasing order, on the other hand, there'd only be 2ドル^{32}$ states to distinguish between (since all you care about is the highest value seen so far), so you'd need a minimum of 32ドル$ bits.

If the values appeared in order, it would also be the case that we have to deal with all 2^32 values eventually, but somehow we only need 32 bits this time, so that argument doesn't work. It's passing through the halfway point with 2^32 uniformly random bits that is interesting
@harold It has to do with how much information we need to save. In this case there are 2ドル^{2^{32}}$ possible states we need to distinguish between.
@Draconis There are 2ドル^{2^{32}}$ possible states because the values can appear in any order. There are only 2ドル^{32}$ states if the values appear in order. I was going to say almost exactly the same thing as harold but decided not to. Focusing on the pigeonhole principle without saying why or even that there are 2ドル^{2^{32}}$ possible states is odd. It's like saying there is an infinitude of primes because of the pigeonhole principle. Your first paragraph is just weirdly and misleadingly written.
That last point about how knowledge of the order of the integers would decrease the required bits is a nice addition.

Stack Exchange Network

Efficient data structure for storing integers in a range?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficient data structure for storing integers in a range?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions