How can I use SQL only on a MySQL DB to replace first and last names of real users in the users
table using a random name from two related tables: random_first_names
and random_last_names
?
Our users
table contains over 250K records and each of the random tables contains over 5000 names that should be picked at random for each record in the users
table.
Is it possible to achieve this using SQL only?
3 Answers 3
If you decide that the Update is too slow, I suggest the following will be about 1000 times as fast.
Loop (can be done in a Stored Proc)...
- [Re]Create a table with randomly ordered set first_names, with a `PRIMARY KEY of 1..5000. Ditto for last_names (a second table).
- Multi-table
UPDATE
the 'next' 5000 rows joined to the two random tables. UseON Users.id % 5000 = RandomFirstNames.id
(etc)
End Loop
Shuffling the table (step 1 of the loop) is something like
CREATE TABLE RandomFirstNames (
id SMALLINT UNSIGNED AUTO_INCREMENT,
first_name VARCHAR(...),
PRIMARY KEY(id) )
SELECT first_name FROM FirstNames ORDER BY RAND();
After OP's UPDATE
Don't do
SELECT count(id) INTO count_names FROM _RandomFirstNames;
Instead, do this once:
SELECT @mask_ct := COUNT(*) FROM _masked_names;
and use @mask_ct
instead of count_names
;
As for the skipped ids, CREATE TABLE _RandomFirstNames
without an id, then ALTER TABLE _RandomFirstNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY;
to get the ids. This should give you ids without gaps (unless you are on a multi-Master cluster of any kind).
-
Thank you Rick. Adding ID after the operation, did the job! Could you please clarify the benefit of using SELECT @mask_ct := COUNT(*) FROM _masked_names; instead of my current solution? In my solution I could number of records of in the names table and go through the loop until it exceed users' count. I'm not sure why I'll need to get count from _masked_names which never changes.user1525248– user15252482018年07月06日 15:12:29 +00:00Commented Jul 6, 2018 at 15:12
-
@user1525248 - Your solution involve counting the resulting table every time through the loop. My version involved a single count before starting the loop. (The difference between
@variables
and Declared variables is not significant.)Rick James– Rick James2018年07月06日 16:27:17 +00:00Commented Jul 6, 2018 at 16:27
You can use ORDER BY rand()
in combination with LIMIT 1
to select a random row of your random names tables.
UPDATE users
SET first_name = (SELECT name
FROM random_first_names
ORDER BY rand()
LIMIT 1),
last_name = (SELECT name
FROM random_last_names
ORDER BY rand()
LIMIT 1);
-
Thank you Sticky Bit, while your solution does the job from functional perspective, this option is not practical as it runs extremely slow. I terminated the script after waiting for 10 min running on a test set of 250,000 users.user1525248– user15252482018年07月06日 03:13:59 +00:00Commented Jul 6, 2018 at 3:13
Thank you Rick and Sticky Bit for you inputs. Sticky Bit's solution would take to long to run. Rick's answer was the closest one and his comments helped me to create the full solution which I'm sharing below.
First, create temporary tables to store random names
DROP TABLE IF EXISTS _RandomFirstNames;
CREATE TABLE _RandomFirstNames (first_name VARCHAR(255));
DROP TABLE IF EXISTS _RandomLastNames;
CREATE TABLE _RandomLastNames (last_name VARCHAR(255));
Then created a procedure to fill those tables with random names to make sure that we have one first name and one last name per each possible user ID.
DELIMITER $$
DROP PROCEDURE IF EXISTS prepare_randon_names$$
CREATE PROCEDURE prepare_randon_names()
BEGIN
SELECT @users := id FROM users ORDER BY id DESC LIMIT 0, 1;
SELECT @mask_ct := COUNT(*) FROM _masked_names._firstnames;
SELECT @loops := @users/@mask_ct;
SELECT @count := 0;
WHILE @count < @loops DO
INSERT INTO _RandomFirstNames (first_name)
SELECT firstname FROM _masked_names._firstnames ORDER BY RAND();
SELECT @count := @count+1;
END WHILE;
SELECT @mask_ct := COUNT(*) FROM _masked_names._lastnames;
SELECT @loops := @users/@mask_ct;
SELECT @count := 0;
WHILE @count < @loops DO
INSERT INTO _RandomLastNames (last_name)
SELECT lastname FROM _masked_names._lastnames ORDER BY RAND();
SELECT @count := @count+1;
END WHILE;
END$$
DELIMITER;
We can now execute it and add incremental IDs to the populated tables
CALL prepare_randon_names();
ALTER TABLE _RandomFirstNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY;
ALTER TABLE _RandomLastNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY;
Now we can update user table with the random names by joining the two new tables with the random names that we created above
UPDATE users u
left join _RandomFirstNames f on f.id = u.id
left join _RandomLastNames l on l.id = u.id
SET u.first_name = f.first_name,
u.last_name = l.last_name;
And the last step, drop the tables with random names as we no longer need those
DROP TABLE IF EXISTS _RandomFirstNames;
DROP TABLE IF EXISTS _RandomLastNames;
Notes Summary
adding primary key after the tables were populated with random names solved an issue with the index skipping count. For example, in _RandomFirstNames, the ID was increased sequentially until ID 5163 and then skiped to 8192 (by 3,029), then it increased sequentially until 13354 and then skiped again by 3,029 to 16383. _RandomFirstNames was generated based on _masked_names._firstnames, which contained 5163 names.
avoiding count(...) within the while loop increased speed by one second when run against
users
table with 250,000 records