Random value duplicated, cannot select N random pairings, subquery not refreshing - sqlite

Question 1

The Problem

Using SQLite v3.35.4 and v3.36.0 I have a first_name table and a surname table that have a list of common names. I want to produce N number of pairings into a new table.

I wrote this recursive query:

WITH RECURSIVE
cte(first_name, surname) AS (
 SELECT first_name, surname from ( -- always returns the same value
 select first_name, surname from (select first_name from first_name order by random() limit 1)
 join (select surname from surname order by random() limit 1)
 )
 UNION ALL
 SELECT first_name, surname
 FROM cte
 LIMIT 2000
)
SELECT first_name, surname FROM cte;

Unfortunately the output looks like this:

+------------+---------+
| first_name | surname |
+------------+---------+
| james | smith |
| james | smith |
| james | smith |
| ---------- | ------- |
| ... | ... |
+------------+---------+

What I've tried

After reviewing the SQLite documentation, I tried NOT MATERIALIZED on the recursive CTE and several conditions outlined by the Subquery Flattening section. I put the random name selection in a view. However, none of it has positively effected the results.

Is there a way to perform what I'm trying to do?

*Edit

I tried a windowing function and selecting the names randomly from a where clause with no success: (where 1998 is the size of the table)

with recursive
r_first_name as (
 select first_name, ROW_NUMBER() over(order by random()) as rn from first_name
),
r_surname as (
 select surname, ROW_NUMBER() over(order by random()) as rn from surname
),
rcte(first_name, surname) as (
 select first_name, surname from r_first_name rf
 join r_surname rs on rs.rn = (select abs(random() % 1998))
 where rf.rn = (select abs(random() % 1998))
 union all
 select first_name, surname from rcte
 limit 3000
)
select * from rcte

!!! Solution !!!

After reviewing this answer on a similar problem.

I discovered that on the recursion side of the CTE's, a random() will successfully update. While, unfortunately, it will not update when nested in a subquery, if it's at the "root" of the CTE recursion, I can utilize it to grab a random number.

Below is the solution I developed. It meets my specific use case and is relatively performant compared to a cross join:

WITH RECURSIVE
cte AS (
 select abs(random()) % (select count(*) from first_name) as first_name_num, abs(random()) % (select count(*) from surname) as surname_num
 union all
 select abs(random()) % (select count(*) from first_name) as first_name_num, abs(random()) % (select count(*) from surname) as surname_num from cte
 LIMIT 6000 
),
result as (
 select * from cte
 join (select first_name, ROW_NUMBER() over (order by random()) as rn from first_name) fn -- this is always the same result
 on cte.first_name_num = fn.rn
 join (select surname, ROW_NUMBER() over (order by random()) as rn from surname) sn -- this updates every loop around except subqueries are compiled/cached or something so they are unusable here if you want updated values
 on cte.surname_num = sn.rn
)
select first_name, surname from result

Question 2

"I want to produce N number of pairings into a new table." Why not just simply do a CROSS JOIN with a LIMIT clause like so?

SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
LIMIT 2000;

This should produce a distinct 2,000 row list of first_name and surname combinations (assuming you have no dupes of either in each table). It's only pseudo-random though because the results are non-deterministic since there's no ORDER BY clause. But it's a lot simpler of a query, especially if you only need to do it to load a table once.

You could also add an ORDER BY clause with the RANDOM() function to improve the randomness from the aforementioned query, like so:

SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
ORDER BY RANDOM()
LIMIT 2000;

My intuition tells me this isn't perfectly random either but should be a lot better than my first query, if you do need some randomness.

Note I'd also recommend not naming columns the exact same name as the tables they belong to. This can be syntactically confusing and make readability more difficult. Personally I'd have a single names table with a field to denote the type of name it is (first name vs surname etc). But if you want two explicit tables, then I'd just name the column generically such as name since it's self-explanatory enough when you select first_name.name.

Question 3

I guess the expectation is that I'd eventually want to weight the random with a normal distribution. So I'd have 1000's more smiths than kincaids. While your solution produces a homogenous mixing, it will work for testing my use cases.

Question 4

@NathanGoings Sounds good. I did just update my answer with a way to improve the randomness if that's of any benefit too.

Question 5

@JD, I discovered the solution and updated my post. I've gone ahead and accepted your answer as it's another valid solution.

J.D. J.D. 41.1k12 gold badges63 silver badges145 bronze badges · Accepted Answer · 2021-11-15 00:00:26Z

"I want to produce N number of pairings into a new table." Why not just simply do a CROSS JOIN with a LIMIT clause like so?

SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
LIMIT 2000;

This should produce a distinct 2,000 row list of first_name and surname combinations (assuming you have no dupes of either in each table). It's only pseudo-random though because the results are non-deterministic since there's no ORDER BY clause. But it's a lot simpler of a query, especially if you only need to do it to load a table once.

You could also add an ORDER BY clause with the RANDOM() function to improve the randomness from the aforementioned query, like so:

SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
ORDER BY RANDOM()
LIMIT 2000;

My intuition tells me this isn't perfectly random either but should be a lot better than my first query, if you do need some randomness.

Note I'd also recommend not naming columns the exact same name as the tables they belong to. This can be syntactically confusing and make readability more difficult. Personally I'd have a single names table with a field to denote the type of name it is (first name vs surname etc). But if you want two explicit tables, then I'd just name the column generically such as name since it's self-explanatory enough when you select first_name.name.

I guess the expectation is that I'd eventually want to weight the random with a normal distribution. So I'd have 1000's more smiths than kincaids. While your solution produces a homogenous mixing, it will work for testing my use cases.
@NathanGoings Sounds good. I did just update my answer with a way to improve the randomness if that's of any benefit too.
@JD, I discovered the solution and updated my post. I've gone ahead and accepted your answer as it's another valid solution.

Stack Exchange Network

Random value duplicated, cannot select N random pairings, subquery not refreshing - sqlite

The Problem

What I've tried

!!! Solution !!!

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Random value duplicated, cannot select N random pairings, subquery not refreshing - sqlite

The Problem

What I've tried

!!! Solution !!!

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions