Let's say I have a table (with millions of rows)
cat1|cat2|value
A | B| 10
...
and I want to group the data by cat1
and also split the grouping into 10 (at least roughly) equal partitions of the data.
The purpose of the random grouping is for the machine learning technique of cross-validation. (It's similar to splitting subjects into a control group and experimental group.)
What I would imagine to use is the following:
select cat1, r, sum(value)
from t
group by cat1, floor(rand()*10) as r
which doesn't work.
One way which I believe would work would be simply to insert a new column with rand into my table and then group appropriately on that.
Is there any way of doing this grouping otherwise?
2 Answers 2
Your post piqued my curiosity and I look at the RAND() function - which you can't perform GROUP BYs with. I found this and if your table has a PRIMARY KEY, there is a MySQL pseudo-column (_rowid), much like is found in Oracle, Firebird (and maybe others) then you can issue queries such as this one (see structure and data for comments table (sample) below).
select MOD(_rowid, 3) AS my_field, COUNT(ticket_id)
FROM comments
GROUP BY my_field;
+----------+------------------+
| my_field | count(ticket_id) |
+----------+------------------+
| 0 | 6 |
| 1 | 7 |
| 2 | 7 |
+----------+------------------+
Now, if your PK is some sort of AUTO INCREMENT and "random" with respect to your data, you could perhaps use the _rowid pseudo-column (with MOD) to generate a sufficient degree of randomness for your requirements, and perform your aggregate queries?
---- structure and data in comments table---
CREATE TABLE `comments`
(
`comment_id` int(11) NOT NULL,
`ticket_id` int(11) NOT NULL,
PRIMARY KEY (`comment_id`)
);
mysql> SELECT * FROM comments;
+------------+-----------+
| comment_id | ticket_id |
+------------+-----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 4 |
| 14 | 4 |
| 15 | 4 |
| 16 | 4 |
| 17 | 5 |
| 18 | 5 |
| 19 | 5 |
| 20 | 5 |
+------------+-----------+
My understanding is that functions, such as rand()
, are executed once per column reference to in SQL. Thus any query which scans the whole table will produce the same "random" value for every row. I think you're going to have to iterate through your data setting random numbers one at a time, RBAR-stylie.
-
Why the down vote?Michael Green– Michael Green2014年07月19日 11:02:36 +00:00Commented Jul 19, 2014 at 11:02
-
I guess because what you say about
RAND()
is not true in MySQL: SQL-Fiddle (I guess you confused this issue with SQL-Server)ypercubeᵀᴹ– ypercubeᵀᴹ2014年07月19日 15:20:13 +00:00Commented Jul 19, 2014 at 15:20
select cat1, r, sum(value) from (select cat1, value, floor(rand()*10) as r from t) as x group by cat1, r ;