Replace users' names with random names

Question 1

How can I use SQL only on a MySQL DB to replace first and last names of real users in the users table using a random name from two related tables: random_first_names and random_last_names?

Our users table contains over 250K records and each of the random tables contains over 5000 names that should be picked at random for each record in the users table.

Is it possible to achieve this using SQL only?

Question 2

If you decide that the Update is too slow, I suggest the following will be about 1000 times as fast.

Loop (can be done in a Stored Proc)...

[Re]Create a table with randomly ordered set first_names, with a `PRIMARY KEY of 1..5000. Ditto for last_names (a second table).
Multi-table UPDATE the 'next' 5000 rows joined to the two random tables. Use ON Users.id % 5000 = RandomFirstNames.id (etc)

End Loop

Shuffling the table (step 1 of the loop) is something like

CREATE TABLE RandomFirstNames (
 id SMALLINT UNSIGNED AUTO_INCREMENT,
 first_name VARCHAR(...),
 PRIMARY KEY(id) )
SELECT first_name FROM FirstNames ORDER BY RAND();

After OP's UPDATE

Don't do

SELECT count(id) INTO count_names FROM _RandomFirstNames;

Instead, do this once:

SELECT @mask_ct := COUNT(*) FROM _masked_names;

and use @mask_ct instead of count_names;

As for the skipped ids, CREATE TABLE _RandomFirstNames without an id, then ALTER TABLE _RandomFirstNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY; to get the ids. This should give you ids without gaps (unless you are on a multi-Master cluster of any kind).

Question 3

Thank you Rick. Adding ID after the operation, did the job! Could you please clarify the benefit of using SELECT @mask_ct := COUNT(*) FROM _masked_names; instead of my current solution? In my solution I could number of records of in the names table and go through the loop until it exceed users' count. I'm not sure why I'll need to get count from _masked_names which never changes.

Question 4

@user1525248 - Your solution involve counting the resulting table every time through the loop. My version involved a single count before starting the loop. (The difference between @variables and Declared variables is not significant.)

Question 5

You can use ORDER BY rand() in combination with LIMIT 1 to select a random row of your random names tables.

UPDATE users
 SET first_name = (SELECT name
 FROM random_first_names
 ORDER BY rand()
 LIMIT 1),
 last_name = (SELECT name
 FROM random_last_names
 ORDER BY rand()
 LIMIT 1);

Question 6

Thank you Sticky Bit, while your solution does the job from functional perspective, this option is not practical as it runs extremely slow. I terminated the script after waiting for 10 min running on a test set of 250,000 users.

Question 7

Thank you Rick and Sticky Bit for you inputs. Sticky Bit's solution would take to long to run. Rick's answer was the closest one and his comments helped me to create the full solution which I'm sharing below.

First, create temporary tables to store random names

DROP TABLE IF EXISTS _RandomFirstNames;
CREATE TABLE _RandomFirstNames (first_name VARCHAR(255));
DROP TABLE IF EXISTS _RandomLastNames;
CREATE TABLE _RandomLastNames (last_name VARCHAR(255));

Then created a procedure to fill those tables with random names to make sure that we have one first name and one last name per each possible user ID.

DELIMITER $$
DROP PROCEDURE IF EXISTS prepare_randon_names$$
CREATE PROCEDURE prepare_randon_names()
BEGIN
 SELECT @users := id FROM users ORDER BY id DESC LIMIT 0, 1;
 SELECT @mask_ct := COUNT(*) FROM _masked_names._firstnames;
 SELECT @loops := @users/@mask_ct;
 SELECT @count := 0;
 WHILE @count < @loops DO
 INSERT INTO _RandomFirstNames (first_name)
 SELECT firstname FROM _masked_names._firstnames ORDER BY RAND();
 SELECT @count := @count+1;
 END WHILE;
 SELECT @mask_ct := COUNT(*) FROM _masked_names._lastnames;
 SELECT @loops := @users/@mask_ct;
 SELECT @count := 0;
 WHILE @count < @loops DO
 INSERT INTO _RandomLastNames (last_name)
 SELECT lastname FROM _masked_names._lastnames ORDER BY RAND();
 SELECT @count := @count+1;
 END WHILE;
END$$
DELIMITER;

We can now execute it and add incremental IDs to the populated tables

CALL prepare_randon_names();
ALTER TABLE _RandomFirstNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY;
ALTER TABLE _RandomLastNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY;

Now we can update user table with the random names by joining the two new tables with the random names that we created above

UPDATE users u
left join _RandomFirstNames f on f.id = u.id
left join _RandomLastNames l on l.id = u.id
 SET u.first_name = f.first_name,
 u.last_name = l.last_name;

And the last step, drop the tables with random names as we no longer need those

DROP TABLE IF EXISTS _RandomFirstNames;
DROP TABLE IF EXISTS _RandomLastNames;

Notes Summary

adding primary key after the tables were populated with random names solved an issue with the index skipping count. For example, in _RandomFirstNames, the ID was increased sequentially until ID 5163 and then skiped to 8192 (by 3,029), then it increased sequentially until 13354 and then skiped again by 3,029 to 16383. _RandomFirstNames was generated based on _masked_names._firstnames, which contained 5163 names.
avoiding count(...) within the while loop increased speed by one second when run against users table with 250,000 records

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Accepted Answer · 2018-06-28 16:51:26Z

If you decide that the Update is too slow, I suggest the following will be about 1000 times as fast.

Loop (can be done in a Stored Proc)...

[Re]Create a table with randomly ordered set first_names, with a `PRIMARY KEY of 1..5000. Ditto for last_names (a second table).
Multi-table UPDATE the 'next' 5000 rows joined to the two random tables. Use ON Users.id % 5000 = RandomFirstNames.id (etc)

End Loop

Shuffling the table (step 1 of the loop) is something like

CREATE TABLE RandomFirstNames (
 id SMALLINT UNSIGNED AUTO_INCREMENT,
 first_name VARCHAR(...),
 PRIMARY KEY(id) )
SELECT first_name FROM FirstNames ORDER BY RAND();

After OP's UPDATE

Don't do

SELECT count(id) INTO count_names FROM _RandomFirstNames;

Instead, do this once:

SELECT @mask_ct := COUNT(*) FROM _masked_names;

and use @mask_ct instead of count_names;

As for the skipped ids, CREATE TABLE _RandomFirstNames without an id, then ALTER TABLE _RandomFirstNames ADD id INT UNSIGNED AUTO_INCREMENT PRIMARY KEY; to get the ids. This should give you ids without gaps (unless you are on a multi-Master cluster of any kind).

Thank you Rick. Adding ID after the operation, did the job! Could you please clarify the benefit of using SELECT @mask_ct := COUNT(*) FROM _masked_names; instead of my current solution? In my solution I could number of records of in the names table and go through the loop until it exceed users' count. I'm not sure why I'll need to get count from _masked_names which never changes.
@user1525248 - Your solution involve counting the resulting table every time through the loop. My version involved a single count before starting the loop. (The difference between @variables and Declared variables is not significant.)

Stack Exchange Network

Replace users' names with random names

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Replace users' names with random names

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions