Log collisions in generating unique random strings

Question 1

I've created a small class to generate unique random strings. It works fine. I also want to log the collisions of the generated strings. I check if the generated string is in the array for uniquness:

public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 while (in_array($hash, $array, true)) {
 $hash = self::rand_str($length);
 }
 $array[] = $hash;
}

To log the collisions, I need to modify the while loop. I can push the hash value into an array in the while loop to log the collisions, but I want something more sophisticated.

public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 while (in_array($hash, $array, true)) {
 self::$repeats[] = $hash;
 $hash = self::rand_str($length);
 }
 $array[] = $hash;
 return self::$repeats;
}

For example, it might generate a few non-unique strings in a row, then while will run a few times. I want to log how many times while ran and what was the new string. The code above would output something like this:

$repeats = ["Usm", "rl2", "mJN", "XBU"];

This doesn't tell much. Have the collisions took place one right after another (in the same add())? What was the new unique random string that substituted the collision?

I want an output like this:

$repeats = [
 // "while" ran count($repeats[i][0]) times
 // repeats new value
 [ ["Usm", "rl2", "mJN"], "XBU"], // add()
 [ ["6cH" ], "GDd"], // add()
 [ ["Kid", "Yjs" ], "no7"], // add()
];

I've created a method to log the collisions but it needs further work. I couldn't handle the forming of $repeats array.

public static function log_repeat($hash, $new = false) {
 if ($new) {
 self::$repeats[count(self::$repeats)-1][] = $hash;
 } else {
 if (count(self::$repeats) == 0) {
 self::$repeats[] = [[$hash]];
 } else {
 self::$repeats[count(self::$repeats)-1][0][] = $hash;
 }
 }
}
public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 $repeat_occured = false;
 while (in_array($hash, $array, true)) {
 $repeat_occured = true;
 self::log_repeat($hash);
 $hash = self::rand_str($length);
 }
 if ($repeat_occured) {
 self::log_repeat($hash, true);
 }
 $array[] = $hash;
 return self::$repeats;
}

This just pushes all the collisions into a single array, and unique strings are added next to each other:

$repeats = [
 [ ["URF", "9IV", "2Z3", "STj", "z1T"], "v4Q", "f9S", "6T3", "6jy" ]
];

You can't tell which new string substituted which repeat. Each add() should have its own row.

I know I need to modify add() and log(), but I'm stuck atm.

Full code:

class Hash {
 public static $char_set = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
 public static $perm_limit = 0;
 public static $repeats = [];
 public static function rand_str($length = 1) {
 $string = "";
 for ($i = 0; $i < $length; $i++) {
 $rand = rand(0, strlen(self::$char_set)-1);
 $string .= self::$char_set[$rand];
 }
 return $string;
 }
 public static function log_repeat($hash, $new = false) {
 if ($new) {
 self::$repeats[count(self::$repeats)-1][] = $hash;
 } else {
 if (count(self::$repeats) == 0) {
 self::$repeats[] = [[$hash]];
 } else {
 self::$repeats[count(self::$repeats)-1][0][] = $hash;
 }
 }
 }
 public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 $repeat_occured = false;
 while (in_array($hash, $array, true)) {
 $repeat_occured = true;
 self::log_repeat($hash);
 $hash = self::rand_str($length);
 }
 if ($repeat_occured) {
 self::log_repeat($hash, true);
 }
 $array[] = $hash;
 return self::$repeats;
 }
 public static function fill(&$array = [], $count = 1, $length = 1) {
 self::$perm_limit = pow(strlen(self::$char_set), $length);
 for ($i = 0; $i < $count; $i++) {
 if (count($array) >= self::$perm_limit) {
 break;
 }
 self::add($array, $length);
 }
 return self::$repeats;
 }
}

Test case:

$hash_list = []
$repeats = Hash::fill($hash_list, 1000, 3);
var_dump($repeats);
var_dump($hash_list);

Question 2

"Were they generated sequentially? ". Well, I certainly hope so. Why would you generate hashes if they didn't collide?

Question 3

@Sumurai8 I was talking about collisions. Let's say I have [1,2,3]. I generate a new number, it's 1. It collides. I generate another one, 2. Collides again. I generate 4. Doesn't collide. Added to the array: [1,2,3,4]. I generate 5. Doesn't collide. Added to the array: [1,2,3,4,5]. Now,1 and 2 were generated sequentially (right after each other, in the same add()). So the output row would be [[1,2], 4]. 5 was added in another add().

Question 4

You use in_array. This is bad, because in_array is O(n) - it goes through all array. In similar cases you should store hashes as keys and use either isset($arr[$hash]) or array_key_exists. Both are O(1) generally but devolve into O(n) as collisions pile up. However, even then they are faster because they only mean iterating over collisions, not all elements. More info here.
You use process that allows collisions (which you don't need as I presume) and then try to solve that by storing generated strings (which would hog up memory if executed lots of times, like in your fill). Let's look deeper into that:

generate unique random strings

How random those strings need to be?

If it's something about cryptography then use of rand is a bad idea and implementing stuff of that sort manually is a bad idea too.

If it's not related to cryptography and you're OK with PRNG (rand uses one of those too) then it could be easier to simply make your own PRNG that generates random-looking unique strings.

Well, it's impossible to keep generating unique strings of some limited length indefinitely but it is possible to generate all the strings before producing first collision.

Take a look at something like this. important part is:

LFSR output streams are deterministic. If the present state and the positions of the XOR gates in the LFSR are known, the next state can be predicted.

Which means amount of information one needs to store does not grow with time. We shall use something like that too.

Actually, if we squint while looking at your alphabet we can see that it is 26 +たす 26 +たす 10 =わ 62 characters long. Which means, if we add couple more characters - say, '@' and '#' then we'll have an easy way to translate 6 bits into a letter and vice versa. This is awesome because we can generate strings of 3 * 6 = 18 bits instead of 3 letters of custom alphabet and tools for generating unique randomly-looking sequences of N bits were researched extensively.

Of course, you'll have to drop strings that contain '#' and '@' but that would only make around 10% of strings so that's not much of a big deal: you generate the string, check if it contains one of filler characters and generate one more instead if it does.

Minor optimisation - you can make both characters the same, since you're dropping those strings anyway - or you can even not put them into alphabet at all, so that [] on nonexistent index would return null which, concatenated with other characters, would translate to ''. Then you simply check string's length - if it's OK, we return the string, otherwise we generate one more.

You don't really need a class for this task if your php version supports generators.

That being said, if you like your solution and simply want to know what collisions were occured for each string generated you can make your $repeats look something like:

$repeats = [
 'XBU' => ["Usm", "rl2", "mJN"],
 'GDd' => ["6cH"],
 'no7' => ["Kid", "Yjs"], 
];

Since your strings are given to be unique, keys will be unique.

To get that list, you can do this:

public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 $current_repeats = [];
 while (in_array($hash, $array, true)) {
 $current_repeats []= $hash;
 $hash = self::rand_str($length);
 }
 if ($current_repeats) {
 self::$repeats[$hash] = $current_repeats;
 }
 $array[] = $hash;
 return self::$repeats;
}

There are still many things wrong with that - do ... while loop should be used instead, underscore naming is kinda against PSR and function returns what is not really expected of it, for example.

Question 5

This was very helpful. I've tested isset() on 20k hashes it was 2x~ faster. Will definitely use that. And your form of $repeats seems better. Already implemented. About your other suggestions, I'm kind of new to both PHP and hashes. I'd really appreciate if you could elaborate on the use of generators and do...while. A code example, perhaps rewriting add(), would be great.

Daerdemandt Daerdemandt 1,0316 silver badges8 bronze badges · Accepted Answer · 2016-07-11 18:48:45Z

You use in_array. This is bad, because in_array is O(n) - it goes through all array. In similar cases you should store hashes as keys and use either isset($arr[$hash]) or array_key_exists. Both are O(1) generally but devolve into O(n) as collisions pile up. However, even then they are faster because they only mean iterating over collisions, not all elements. More info here.
You use process that allows collisions (which you don't need as I presume) and then try to solve that by storing generated strings (which would hog up memory if executed lots of times, like in your fill). Let's look deeper into that:

generate unique random strings

How random those strings need to be?

If it's something about cryptography then use of rand is a bad idea and implementing stuff of that sort manually is a bad idea too.

If it's not related to cryptography and you're OK with PRNG (rand uses one of those too) then it could be easier to simply make your own PRNG that generates random-looking unique strings.

Well, it's impossible to keep generating unique strings of some limited length indefinitely but it is possible to generate all the strings before producing first collision.

Take a look at something like this. important part is:

LFSR output streams are deterministic. If the present state and the positions of the XOR gates in the LFSR are known, the next state can be predicted.

Which means amount of information one needs to store does not grow with time. We shall use something like that too.

Actually, if we squint while looking at your alphabet we can see that it is 26 +たす 26 +たす 10 =わ 62 characters long. Which means, if we add couple more characters - say, '@' and '#' then we'll have an easy way to translate 6 bits into a letter and vice versa. This is awesome because we can generate strings of 3 * 6 = 18 bits instead of 3 letters of custom alphabet and tools for generating unique randomly-looking sequences of N bits were researched extensively.

Of course, you'll have to drop strings that contain '#' and '@' but that would only make around 10% of strings so that's not much of a big deal: you generate the string, check if it contains one of filler characters and generate one more instead if it does.

Minor optimisation - you can make both characters the same, since you're dropping those strings anyway - or you can even not put them into alphabet at all, so that [] on nonexistent index would return null which, concatenated with other characters, would translate to ''. Then you simply check string's length - if it's OK, we return the string, otherwise we generate one more.

You don't really need a class for this task if your php version supports generators.

That being said, if you like your solution and simply want to know what collisions were occured for each string generated you can make your $repeats look something like:

$repeats = [
 'XBU' => ["Usm", "rl2", "mJN"],
 'GDd' => ["6cH"],
 'no7' => ["Kid", "Yjs"], 
];

Since your strings are given to be unique, keys will be unique.

To get that list, you can do this:

public static function add(&$array = [], $length = 1) {
 $hash = self::rand_str($length);
 $current_repeats = [];
 while (in_array($hash, $array, true)) {
 $current_repeats []= $hash;
 $hash = self::rand_str($length);
 }
 if ($current_repeats) {
 self::$repeats[$hash] = $current_repeats;
 }
 $array[] = $hash;
 return self::$repeats;
}

There are still many things wrong with that - do ... while loop should be used instead, underscore naming is kinda against PSR and function returns what is not really expected of it, for example.

This was very helpful. I've tested isset() on 20k hashes it was 2x~ faster. Will definitely use that. And your form of $repeats seems better. Already implemented. About your other suggestions, I'm kind of new to both PHP and hashes. I'd really appreciate if you could elaborate on the use of generators and do...while. A code example, perhaps rewriting add(), would be great.

Stack Exchange Network

Log collisions in generating unique random strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Log collisions in generating unique random strings

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions