Let's suppose we have this expression:
(i % 1000000) * 1000 + ms
where i
is an always increasing number, and ms
is the millisecond part of the current time ( ranging 0..999). So each time we are calling the expression above, at random intervals, while i
is increasing always obtaining all unique values, the result of the expression above will potentially returns duplicated, intuitively. How to show this in an acceptable form? Is there a way to show the probability next iteration will generate a duplicate?
3 Answers 3
OK so in practice you will get a repeat at 1 million and a few. Not at 1 billion and 1.
You have 1000 random values 0-999. each number after the first million has a 1 in a thousand chance of hitting the same random number as its previous twin number.
So the chance of getting a duplicate on each number after 1m is 1/1000. Each time you take another number, the cumulative chance of getting a duplicate is higher. After 1,000,000 + n numbers the probability of a duplicate having occurred, p, is:
p = 1-(1/1000)^n
ie. 693 rolls of that dice later and you have a 50% chance of having had a duplicate. 5000 rolls and that's risen to 99.3%
(Obviously if you get to 2m then you have a 2/1000 chance for each number and so on, so you would need a step function to model the probability for all n)
-
Pretty sure your maths is wrong here. There is zero chance that the 1,000,002nd invocation collides with anything other than the 2nd invocation, because it has the form 000002xxx and the only previous invocation with that pattern was the 2nd invocation. It's 1,000,000 separate buckets each of size 1,000, so starting to get a reasonable chance of a collision when there are sqrt(1000) = 33 or so in each bucket, or 33M invocations. Still less than 1B but much more than 1,000,693.Philip Kendall– Philip Kendall2022年07月12日 22:20:17 +00:00Commented Jul 12, 2022 at 22:20
-
so 1,000,002 has a 1/1000 chance, 1,000,003 has a 1/1000 chance... etc by 1,000,693 you have had to be lucky for none of those chances to come upEwan– Ewan2022年07月12日 22:32:08 +00:00Commented Jul 12, 2022 at 22:32
-
you would only hit 1b without a repeat if all the first 1m were xxxxxx000, the second 1m were xxxxxx001 etcEwan– Ewan2022年07月12日 22:40:25 +00:00Commented Jul 12, 2022 at 22:40
-
1+1 It's quite plausible to treat the
ms
value as a random value, independent ofi
. And then the math is correct.Ralf Kleberhoff– Ralf Kleberhoff2022年07月13日 08:56:34 +00:00Commented Jul 13, 2022 at 8:56
- The minimum value of your expression is 0, when
i % 1000000 = 0
andms = 0
. - The maximum value of your expression is 999999999, when
i % 1000000 = 999999
andms = 999
.
Therefore via the pigeonhole principle after 1000000001 evaluations it must generate a duplicate.
-
let' suppose I have a call when i is one million and ms is 1, then a call with i = 20000001 and ms is zero, I have a duplicate, am I wrong?Felice Pollano– Felice Pollano2022年07月12日 07:03:10 +00:00Commented Jul 12, 2022 at 7:03
-
I realised I mis-parsed your expression so I've re-written my answer. This is now pretty much the classic solution to this problem.Philip Kendall– Philip Kendall2022年07月12日 08:02:40 +00:00Commented Jul 12, 2022 at 8:02
-
Although true I think this is misleading. Say you are in the meeting where someone is saying this is a fine method of generating non-repeating numbers. You say, but it will definitely repeat after 1 billion! They reply "PAH! 1 billion!! we will never get to that many!!" When in actual fact you have a very high chance of getting a duplicate after 1 million.Ewan– Ewan2022年07月12日 21:40:57 +00:00Commented Jul 12, 2022 at 21:40
-
@Ewan True. With this specification I can implement it so that it must repeat after 1000. Ever increasing doesn't have to mean +1.candied_orange– candied_orange2022年07月14日 18:07:06 +00:00Commented Jul 14, 2022 at 18:07
With size-limited data types, there must eventually be duplicates, as they can only have a limited number of different states (e.g. 2^32 for the typical 32-bit integer).
In your case, it's even more limited:
(i % 1000000)
can have one million different values (if we allow negativei
values and the typical interpretation of the%
operator, it's nearly two millions, 1999999 values, but I assumei
will always be positive).- Then (i % 1000000) * 1000 can also have exactly one million different values.
ms
can have one thousand different values.- The sum
(i % 1000000) * 1000 + ms
combines one million cases with one thousand cases, giving one billion different cases.
So, at least after one billion invocations, there must be a duplicate.
But there's the peculiar combination of a counter i
with a real-time milliseconds value.
If you systematically increment i
by one with every invocation, the earliest possible duplicate comes after one million invocations.
Depending on the time span for a million invocations, it's highly unlikely that you get one billion different values before the first duplicate. I'd expect something between one and two million invocations, unless you reach multi-million invocations per second.
If you want to make the best duplicates-free use of the range 0...999999999, simply use:
i % 1000000000
Then you're sure that you have the first duplicate exactly after one billion invocations.
-
1Very minor point: the first duplicate will occur after 1 billion plus one invocations, not one billion.Philip Kendall– Philip Kendall2022年07月12日 09:07:55 +00:00Commented Jul 12, 2022 at 9:07
f(i=x, ms) = f(i=x+n*1000000, ms)
i<1000000
, does not happen. Ifi>=1000000
, 100% guaranteed.i
is an always increasing number". Increasing by how much? Without specifying that the first 6 digits can behave wildly different. Everything from some never changing 6 digit number to completely random 6 digit numbers. Why? Because "always increasing" can be satisfied by digits we never see. Doesn't change 1000000001 evaluations forcing a duplicate but it impacts the probability of previous evaluations.