I've created a small class to generate unique random strings. It works fine. I also want to log the collisions of the generated strings. I check if the generated string is in the array for uniquness:
public static function add(&$array = [], $length = 1) {
$hash = self::rand_str($length);
while (in_array($hash, $array, true)) {
$hash = self::rand_str($length);
}
$array[] = $hash;
}
To log the collisions, I need to modify the while
loop. I can push the hash
value into an array in the while
loop to log the collisions, but I want something more sophisticated.
public static function add(&$array = [], $length = 1) {
$hash = self::rand_str($length);
while (in_array($hash, $array, true)) {
self::$repeats[] = $hash;
$hash = self::rand_str($length);
}
$array[] = $hash;
return self::$repeats;
}
For example, it might generate a few non-unique strings in a row, then while
will run a few times. I want to log how many times while ran and what was the new string. The code above would output something like this:
$repeats = ["Usm", "rl2", "mJN", "XBU"];
This doesn't tell much. Have the collisions took place one right after another (in the same add()
)? What was the new unique random string that substituted the collision?
I want an output like this:
$repeats = [
// "while" ran count($repeats[i][0]) times
// repeats new value
[ ["Usm", "rl2", "mJN"], "XBU"], // add()
[ ["6cH" ], "GDd"], // add()
[ ["Kid", "Yjs" ], "no7"], // add()
];
I've created a method to log the collisions but it needs further work. I couldn't handle the forming of $repeats
array.
public static function log_repeat($hash, $new = false) {
if ($new) {
self::$repeats[count(self::$repeats)-1][] = $hash;
} else {
if (count(self::$repeats) == 0) {
self::$repeats[] = [[$hash]];
} else {
self::$repeats[count(self::$repeats)-1][0][] = $hash;
}
}
}
public static function add(&$array = [], $length = 1) {
$hash = self::rand_str($length);
$repeat_occured = false;
while (in_array($hash, $array, true)) {
$repeat_occured = true;
self::log_repeat($hash);
$hash = self::rand_str($length);
}
if ($repeat_occured) {
self::log_repeat($hash, true);
}
$array[] = $hash;
return self::$repeats;
}
This just pushes all the collisions into a single array, and unique strings are added next to each other:
$repeats = [
[ ["URF", "9IV", "2Z3", "STj", "z1T"], "v4Q", "f9S", "6T3", "6jy" ]
];
You can't tell which new string substituted which repeat. Each add()
should have its own row.
I know I need to modify add()
and log()
, but I'm stuck atm.
Full code:
class Hash {
public static $char_set = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789";
public static $perm_limit = 0;
public static $repeats = [];
public static function rand_str($length = 1) {
$string = "";
for ($i = 0; $i < $length; $i++) {
$rand = rand(0, strlen(self::$char_set)-1);
$string .= self::$char_set[$rand];
}
return $string;
}
public static function log_repeat($hash, $new = false) {
if ($new) {
self::$repeats[count(self::$repeats)-1][] = $hash;
} else {
if (count(self::$repeats) == 0) {
self::$repeats[] = [[$hash]];
} else {
self::$repeats[count(self::$repeats)-1][0][] = $hash;
}
}
}
public static function add(&$array = [], $length = 1) {
$hash = self::rand_str($length);
$repeat_occured = false;
while (in_array($hash, $array, true)) {
$repeat_occured = true;
self::log_repeat($hash);
$hash = self::rand_str($length);
}
if ($repeat_occured) {
self::log_repeat($hash, true);
}
$array[] = $hash;
return self::$repeats;
}
public static function fill(&$array = [], $count = 1, $length = 1) {
self::$perm_limit = pow(strlen(self::$char_set), $length);
for ($i = 0; $i < $count; $i++) {
if (count($array) >= self::$perm_limit) {
break;
}
self::add($array, $length);
}
return self::$repeats;
}
}
Test case:
$hash_list = []
$repeats = Hash::fill($hash_list, 1000, 3);
var_dump($repeats);
var_dump($hash_list);
1 Answer 1
You use
in_array
. This is bad, becausein_array
is O(n) - it goes through all array. In similar cases you should store hashes as keys and use eitherisset($arr[$hash])
orarray_key_exists
. Both are O(1) generally but devolve into O(n) as collisions pile up. However, even then they are faster because they only mean iterating over collisions, not all elements. More info here.You use process that allows collisions (which you don't need as I presume) and then try to solve that by storing generated strings (which would hog up memory if executed lots of times, like in your
fill
). Let's look deeper into that:
generate unique random strings
How random those strings need to be?
If it's something about cryptography then use of rand
is a bad idea and implementing stuff of that sort manually is a bad idea too.
If it's not related to cryptography and you're OK with PRNG (rand
uses one of those too) then it could be easier to simply make your own PRNG that generates random-looking unique strings.
Well, it's impossible to keep generating unique strings of some limited length indefinitely but it is possible to generate all the strings before producing first collision.
Take a look at something like this. important part is:
LFSR output streams are deterministic. If the present state and the positions of the XOR gates in the LFSR are known, the next state can be predicted.
Which means amount of information one needs to store does not grow with time. We shall use something like that too.
Actually, if we squint while looking at your alphabet we can see that it is 26 +たす 26 +たす 10 =わ 62 characters long. Which means, if we add couple more characters - say, '@' and '#' then we'll have an easy way to translate 6 bits into a letter and vice versa. This is awesome because we can generate strings of 3 * 6 = 18 bits instead of 3 letters of custom alphabet and tools for generating unique randomly-looking sequences of N bits were researched extensively.
Of course, you'll have to drop strings that contain '#' and '@' but that would only make around 10% of strings so that's not much of a big deal: you generate the string, check if it contains one of filler characters and generate one more instead if it does.
Minor optimisation - you can make both characters the same, since you're dropping those strings anyway - or you can even not put them into alphabet at all, so that []
on nonexistent index would return null
which, concatenated with other characters, would translate to ''
. Then you simply check string's length - if it's OK, we return the string, otherwise we generate one more.
You don't really need a class for this task if your php version supports generators.
That being said, if you like your solution and simply want to know what collisions were occured for each string generated you can make your $repeats
look something like:
$repeats = [
'XBU' => ["Usm", "rl2", "mJN"],
'GDd' => ["6cH"],
'no7' => ["Kid", "Yjs"],
];
Since your strings are given to be unique, keys will be unique.
To get that list, you can do this:
public static function add(&$array = [], $length = 1) {
$hash = self::rand_str($length);
$current_repeats = [];
while (in_array($hash, $array, true)) {
$current_repeats []= $hash;
$hash = self::rand_str($length);
}
if ($current_repeats) {
self::$repeats[$hash] = $current_repeats;
}
$array[] = $hash;
return self::$repeats;
}
There are still many things wrong with that - do ... while
loop should be used instead, underscore naming is kinda against PSR and function returns what is not really expected of it, for example.
-
\$\begingroup\$ This was very helpful. I've tested
isset()
on 20k hashes it was 2x~ faster. Will definitely use that. And your form of$repeats
seems better. Already implemented. About your other suggestions, I'm kind of new to both PHP and hashes. I'd really appreciate if you could elaborate on the use of generators anddo...while
. A code example, perhaps rewritingadd()
, would be great. \$\endgroup\$akinuri– akinuri2016年07月11日 20:26:01 +00:00Commented Jul 11, 2016 at 20:26
[1,2,3]
. I generate a new number, it's1
. It collides. I generate another one,2
. Collides again. I generate4
. Doesn't collide. Added to the array:[1,2,3,4]
. I generate5
. Doesn't collide. Added to the array:[1,2,3,4,5]
. Now,1
and2
were generated sequentially (right after each other, in the sameadd()
). So the output row would be[[1,2], 4]
.5
was added in anotheradd()
. \$\endgroup\$