Perform group by with random group expression (randomly partitioning rows) MYSQL

Question 1

Let's say I have a table (with millions of rows)

cat1|cat2|value
A | B| 10 
...

and I want to group the data by cat1 and also split the grouping into 10 (at least roughly) equal partitions of the data.

The purpose of the random grouping is for the machine learning technique of cross-validation. (It's similar to splitting subjects into a control group and experimental group.)

What I would imagine to use is the following:

select cat1, r, sum(value)
from t 
group by cat1, floor(rand()*10) as r

which doesn't work.

One way which I believe would work would be simply to insert a new column with rand into my table and then group appropriately on that.

Is there any way of doing this grouping otherwise?

Question 2

Don't you need to store the grouping assignment for each row so that the groupings can be replicated at a later date? Is it a business requirement that the groupings be random, or is it simply necessary that they roughly equal and non-contiguous?

Question 3

@pieter - yes it should be random and non contiguous. I don't really need to store grouping assignment (but I should be able to recreate if pass same random number? - rows are not updated)

Question 4

This would work: select cat1, r, sum(value) from (select cat1, value, floor(rand()*10) as r from t) as x group by cat1, r ;

Question 5

Your post piqued my curiosity and I look at the RAND() function - which you can't perform GROUP BYs with. I found this and if your table has a PRIMARY KEY, there is a MySQL pseudo-column (_rowid), much like is found in Oracle, Firebird (and maybe others) then you can issue queries such as this one (see structure and data for comments table (sample) below).

select MOD(_rowid, 3) AS my_field, COUNT(ticket_id) 
FROM comments 
GROUP BY my_field;
+----------+------------------+
| my_field | count(ticket_id) |
+----------+------------------+
| 0 | 6 |
| 1 | 7 |
| 2 | 7 |
+----------+------------------+

Now, if your PK is some sort of AUTO INCREMENT and "random" with respect to your data, you could perhaps use the _rowid pseudo-column (with MOD) to generate a sufficient degree of randomness for your requirements, and perform your aggregate queries?

---- structure and data in comments table---

CREATE TABLE `comments` 
(
 `comment_id` int(11) NOT NULL,
 `ticket_id` int(11) NOT NULL,
 PRIMARY KEY (`comment_id`)
);
mysql> SELECT * FROM comments;
+------------+-----------+
| comment_id | ticket_id |
+------------+-----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 4 |
| 14 | 4 |
| 15 | 4 |
| 16 | 4 |
| 17 | 5 |
| 18 | 5 |
| 19 | 5 |
| 20 | 5 |
+------------+-----------+

Question 6

My understanding is that functions, such as rand(), are executed once per column reference to in SQL. Thus any query which scans the whole table will produce the same "random" value for every row. I think you're going to have to iterate through your data setting random numbers one at a time, RBAR-stylie.

Question 7

Why the down vote?

Question 8

I guess because what you say about RAND() is not true in MySQL: SQL-Fiddle (I guess you confused this issue with SQL-Server)

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Answer 1 · 2014-07-19 11:04:55Z

Your post piqued my curiosity and I look at the RAND() function - which you can't perform GROUP BYs with. I found this and if your table has a PRIMARY KEY, there is a MySQL pseudo-column (_rowid), much like is found in Oracle, Firebird (and maybe others) then you can issue queries such as this one (see structure and data for comments table (sample) below).

select MOD(_rowid, 3) AS my_field, COUNT(ticket_id) 
FROM comments 
GROUP BY my_field;
+----------+------------------+
| my_field | count(ticket_id) |
+----------+------------------+
| 0 | 6 |
| 1 | 7 |
| 2 | 7 |
+----------+------------------+

Now, if your PK is some sort of AUTO INCREMENT and "random" with respect to your data, you could perhaps use the _rowid pseudo-column (with MOD) to generate a sufficient degree of randomness for your requirements, and perform your aggregate queries?

---- structure and data in comments table---

CREATE TABLE `comments` 
(
 `comment_id` int(11) NOT NULL,
 `ticket_id` int(11) NOT NULL,
 PRIMARY KEY (`comment_id`)
);
mysql> SELECT * FROM comments;
+------------+-----------+
| comment_id | ticket_id |
+------------+-----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 4 |
| 14 | 4 |
| 15 | 4 |
| 16 | 4 |
| 17 | 5 |
| 18 | 5 |
| 19 | 5 |
| 20 | 5 |
+------------+-----------+

score -1 · Answer 2 · 2014-07-18 23:16:21Z

-1

My understanding is that functions, such as rand(), are executed once per column reference to in SQL. Thus any query which scans the whole table will produce the same "random" value for every row. I think you're going to have to iterate through your data setting random numbers one at a time, RBAR-stylie.

Share

Improve this answer

answered Jul 18, 2014 at 23:16

Michael Green's user avatar

Michael Green Michael Green

25.3k13 gold badges54 silver badges100 bronze badges

2

Why the down vote?

Michael Green
– Michael Green

2014年07月19日 11:02:36 +00:00
Commented Jul 19, 2014 at 11:02
I guess because what you say about RAND() is not true in MySQL: SQL-Fiddle (I guess you confused this issue with SQL-Server)

ypercubeᵀᴹ
– ypercubeᵀᴹ

2014年07月19日 15:20:13 +00:00
Commented Jul 19, 2014 at 15:20

Add a comment |

Stack Exchange Network

Perform group by with random group expression (randomly partitioning rows) MYSQL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Perform group by with random group expression (randomly partitioning rows) MYSQL

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions