I've created a "simple" function that receives a MySQLi resource and, using a custom where query, will fetch random IDs from a provided table.
Some caveats on this:
- No object-style! This needs to be understood by people who only know the old
mysql_*procedural API; - No
order by rand()! It is slow, but very tempting; - It has to allow me to fetch a fixed number of IDs, or all available IDs;
- It must always return something, even if it is a simple array;
- It is required to allow for a table name to be passed;
- A custom
whereis required, since I need to provide complex queries; - It doesn't really matter if it is a truly random function: pseudo-random is perfectly fine;
- No prepared statements: this function will be used alongside some prepared statements. Those will give errors like "Commands out of sync; you can't run this command now";
- The code needs to run on PHP 5.3.29.
And here's the code:
<?PHP
function get_random_ids_from_table(&$link, $table, $where = null, $length = -1)
{
$ids_r = mysqli_query($link, 'select id from `' . $table . '`' . ($where ? ' where ' . $where : ''));
if(!$ids_r)
{
return array();
}
$count = mysqli_num_rows($ids_r);
if(!$count)
{
return array();
}
$ids = array();
if($length < 1 || $length > $count)
{
while($id = mysqli_fetch_assoc($ids_r))
{
$ids[] = $id['id'];
}
mysqli_free_result($ids_r);
shuffle($ids);
}
else
{
while(count($ids) < $length)
{
$offset = mt_rand(0, $count - 1);
if(!isset($ids[$offset]))
{
mysqli_data_seek($ids_r, $offset);
$id = mysqli_fetch_assoc($ids_r);
$ids[$offset] = $id['id'];
}
}
$ids = array_values($ids);
}
mysqli_free_result($ids_r);
return $ids;
}
This function works perfectly fine as-is, but it has a problem: it's still slow! It takes 500-1400ms when executed 4 times! I've already reduced its execution time from 3s. (Old version, for reference: http://pastebin.com/ASfBfWmg). And this is to select IDs from 500 rows. I'm expecting that the code runs on 2000+ rows.
How can I improve its horrendous speed? And which other changes you think that are necessary to make it more readable?
1 Answer 1
In the comments section, we talked about options around the DB queries. It sounds like ORDER BY rand() may not be suitable for your use case in dealing with those execution times (which still seem odd to me given size of dataset).
So let's assume rather than look at optimizing that randomized query (again there are many good examples of this out there), or caching query results locally in some manner that you find appropriate, we are just going to look at simplifying your code, but sticking with the approach of randomizing the id's in the application.
I would suggest that you test through both sides of your if-else conditional to understand the relative completion speed of the "download all ids and shuffle" vs. the "random row retrieval" side of the conditional. Ask yourself things like:
- Do you really need both sides of this conditional?
- What are the actual use cases of your application?
- Are you primarily going to be grabbing one or a few rows at a time vs. grabbing entire set of id's?
This may influence your way of thinking about this section of code. It might even, for example, prompt you to split this function with one function being called to get full result set and another function being called to get partial result set.
You may want to look at randomization approach in the data seek side of the conditional. If you were trying to retrieve a substantial portion of (but not full) result set, you could waste a lot of time in the while loop trying to find an offset that has not already been set in the result array.
A better approach to randomization might be:
// generate array of offsets to search by getting
// $length number of random offset keys
// assume $length has already been verified to be <= $count
// not shown - you might check whether sorting $target_offsets helps with access time at all
$target_offsets = array_rand(range(0, $count - 1), $length);
for ($i = 0; $i < $length; i++) {
// seek record and read into $ids array
mysqli_data_seek($ids_r, $target_offsets[$i]);
$id = mysqli_fetch_assoc($ids_r);
$ids[] = $id['id'];
}
Regardless of approach here, you need to understand if at some percentage of full result set, it might be faster to retrieve full array, shuffle and slice to get the desired result vs. random row retrieval.
A few other thoughts in looking at your code:
- Always strive to use meaningful variables names.
Consider these examples:
$ids_r => $id_results
$link => $mysqli_link
$length => $record_limit
These immediately give code reader exact understanding of what is stored in the variable.
- You are doing nothing to validate the input to your function. You shoudd consider validating, for example that a valid mysqli resource, non-zero length string for table name and where clause, and integer value for record limit are passed. Throw exception of fail out quickly if these conditions are not met.
- Your code has no comments. You should always consider appropriate comments to make your code more readable.
- Speaking of more readable code. You should strive to keep your code < 80 chraracters per line. You only have one questionable example in your code:
Here:
$ids_r = mysqli_query($link, 'select id from `' . $table . '`' . ($where ? ' where ' . $where : ''));
which could be made more readable like:
$where = $where ? ' WHERE ' . $where : '';
$sql = 'SELECT id FROM `' . $table . '`' . $where;
$id_results = mysqli_query($link, $sql);
-
\$\begingroup\$ Won't your alternative method retrieve repeated IDs? Also,
$id_resultsis not a good name.$ids_resultwould be a better name, since you're retreaving multiple IDs with a single result. I do agree about$mysqli_linkand$record_limit. I don't validate the input because the input is all done by me. It is guaranteed that the input is withing spec. Always. But a basic canity check might not be a bad idea. \$\endgroup\$Ismael Miguel– Ismael Miguel2016年10月19日 20:43:47 +00:00Commented Oct 19, 2016 at 20:43 -
\$\begingroup\$ @IsmaelMiguel The means by which
$target_offsetsis defined will not retrieve duplicate offset keys.array_rand()will generate an array of$lengthelements. Each of those elements will be hold a unique key value from the array created fromrange()(where the keys actually match the values). There are other ways to approach this problem as well (like shuffling the array built fromrange()and simply grabbing first$lengthitems, so if you are not happy with "randomness" generated byarray_rand()you can certainly tweak these couple lines of code to an alternate implementation. \$\endgroup\$Mike Brant– Mike Brant2016年10月19日 21:27:11 +00:00Commented Oct 19, 2016 at 21:27 -
\$\begingroup\$ Isn't it better to use a
foreach, like on here: pastebin.com/xZPE8DuE ? \$\endgroup\$Ismael Miguel– Ismael Miguel2016年10月19日 21:53:58 +00:00Commented Oct 19, 2016 at 21:53 -
\$\begingroup\$ @IsmaelMiguel You could just as easily use
forEach(). Whether it is "better" or not might depend on who you ask, as there are some who tend to get hung on on some known use cases whereforEach()does not perform as well as simple loop. A lot of time the decision may also come down to what sort of scope context you need and whether you need to break from the loop at some point (neither of which should be concern here). \$\endgroup\$Mike Brant– Mike Brant2016年10月19日 22:20:06 +00:00Commented Oct 19, 2016 at 22:20 -
\$\begingroup\$ I can see what you mean. I don't know why but the
foreachlooks cleaner. A lot cleaner. I will have to try and see which one is "the best" on my system. \$\endgroup\$Ismael Miguel– Ismael Miguel2016年10月20日 07:55:02 +00:00Commented Oct 20, 2016 at 7:55
mysqli_data_seek(), which may require that all the rows are fetched and kept somewhere. \$\endgroup\$SELECT ... ORDER BY RAND() ... LIMIT ? ...query would greatly simplify your code. Try it out to see what query times you can expect from the knowingly sub-optimal randomized query. You might find it is reasonable compared to your current approach. Your number of rows should not be a problem at all for a randomized query to give "reasonable" performance. \$\endgroup\$