The Problem
Using SQLite v3.35.4 and v3.36.0 I have a first_name
table and a surname
table that have a list of common names. I want to produce N number of pairings into a new table.
I wrote this recursive query:
WITH RECURSIVE
cte(first_name, surname) AS (
SELECT first_name, surname from ( -- always returns the same value
select first_name, surname from (select first_name from first_name order by random() limit 1)
join (select surname from surname order by random() limit 1)
)
UNION ALL
SELECT first_name, surname
FROM cte
LIMIT 2000
)
SELECT first_name, surname FROM cte;
Unfortunately the output looks like this:
+------------+---------+
| first_name | surname |
+------------+---------+
| james | smith |
| james | smith |
| james | smith |
| ---------- | ------- |
| ... | ... |
+------------+---------+
What I've tried
After reviewing the SQLite documentation, I tried NOT MATERIALIZED
on the recursive CTE and several conditions outlined by the Subquery Flattening section. I put the random name selection in a view. However, none of it has positively effected the results.
Is there a way to perform what I'm trying to do?
*Edit
I tried a windowing function and selecting the names randomly from a where clause with no success: (where 1998 is the size of the table)
with recursive
r_first_name as (
select first_name, ROW_NUMBER() over(order by random()) as rn from first_name
),
r_surname as (
select surname, ROW_NUMBER() over(order by random()) as rn from surname
),
rcte(first_name, surname) as (
select first_name, surname from r_first_name rf
join r_surname rs on rs.rn = (select abs(random() % 1998))
where rf.rn = (select abs(random() % 1998))
union all
select first_name, surname from rcte
limit 3000
)
select * from rcte
!!! Solution !!!
After reviewing this answer on a similar problem.
I discovered that on the recursion side of the CTE's, a random()
will successfully update. While, unfortunately, it will not update when nested in a subquery, if it's at the "root" of the CTE recursion, I can utilize it to grab a random number.
Below is the solution I developed. It meets my specific use case and is relatively performant compared to a cross join:
WITH RECURSIVE
cte AS (
select abs(random()) % (select count(*) from first_name) as first_name_num, abs(random()) % (select count(*) from surname) as surname_num
union all
select abs(random()) % (select count(*) from first_name) as first_name_num, abs(random()) % (select count(*) from surname) as surname_num from cte
LIMIT 6000
),
result as (
select * from cte
join (select first_name, ROW_NUMBER() over (order by random()) as rn from first_name) fn -- this is always the same result
on cte.first_name_num = fn.rn
join (select surname, ROW_NUMBER() over (order by random()) as rn from surname) sn -- this updates every loop around except subqueries are compiled/cached or something so they are unusable here if you want updated values
on cte.surname_num = sn.rn
)
select first_name, surname from result
1 Answer 1
"I want to produce N number of pairings into a new table." Why not just simply do a CROSS JOIN
with a LIMIT
clause like so?
SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
LIMIT 2000;
This should produce a distinct 2,000 row list of first_name
and surname
combinations (assuming you have no dupes of either in each table). It's only pseudo-random though because the results are non-deterministic since there's no ORDER BY
clause. But it's a lot simpler of a query, especially if you only need to do it to load a table once.
You could also add an ORDER BY
clause with the RANDOM()
function to improve the randomness from the aforementioned query, like so:
SELECT fn.first_name, sn.surname
FROM first_name fn
CROSS JOIN surname sn
ORDER BY RANDOM()
LIMIT 2000;
My intuition tells me this isn't perfectly random either but should be a lot better than my first query, if you do need some randomness.
Note I'd also recommend not naming columns the exact same name as the tables they belong to. This can be syntactically confusing and make readability more difficult. Personally I'd have a single names
table with a field to denote the type of name it is (first name vs surname etc). But if you want two explicit tables, then I'd just name the column generically such as name
since it's self-explanatory enough when you select first_name.name
.
-
1I guess the expectation is that I'd eventually want to weight the random with a normal distribution. So I'd have 1000's more
smith
s thankincaid
s. While your solution produces a homogenous mixing, it will work for testing my use cases.Nathan Goings– Nathan Goings2021年11月15日 00:16:41 +00:00Commented Nov 15, 2021 at 0:16 -
@NathanGoings Sounds good. I did just update my answer with a way to improve the randomness if that's of any benefit too.J.D.– J.D.2021年11月15日 01:09:47 +00:00Commented Nov 15, 2021 at 1:09
-
1@JD, I discovered the solution and updated my post. I've gone ahead and accepted your answer as it's another valid solution.Nathan Goings– Nathan Goings2021年11月15日 02:07:29 +00:00Commented Nov 15, 2021 at 2:07