1

Let's say I have a table (with millions of rows)

cat1|cat2|value
A | B| 10 
...

and I want to group the data by cat1 and also split the grouping into 10 (at least roughly) equal partitions of the data.

The purpose of the random grouping is for the machine learning technique of cross-validation. (It's similar to splitting subjects into a control group and experimental group.)

What I would imagine to use is the following:

select cat1, r, sum(value)
from t 
group by cat1, floor(rand()*10) as r

which doesn't work.

One way which I believe would work would be simply to insert a new column with rand into my table and then group appropriately on that.

Is there any way of doing this grouping otherwise?

Michael Green
25.3k13 gold badges54 silver badges100 bronze badges
asked Jul 18, 2014 at 22:52
3
  • Don't you need to store the grouping assignment for each row so that the groupings can be replicated at a later date? Is it a business requirement that the groupings be random, or is it simply necessary that they roughly equal and non-contiguous? Commented Jul 19, 2014 at 3:17
  • @pieter - yes it should be random and non contiguous. I don't really need to store grouping assignment (but I should be able to recreate if pass same random number? - rows are not updated) Commented Jul 19, 2014 at 7:30
  • This would work: select cat1, r, sum(value) from (select cat1, value, floor(rand()*10) as r from t) as x group by cat1, r ; Commented Jul 19, 2014 at 15:27

2 Answers 2

1

Your post piqued my curiosity and I look at the RAND() function - which you can't perform GROUP BYs with. I found this and if your table has a PRIMARY KEY, there is a MySQL pseudo-column (_rowid), much like is found in Oracle, Firebird (and maybe others) then you can issue queries such as this one (see structure and data for comments table (sample) below).

select MOD(_rowid, 3) AS my_field, COUNT(ticket_id) 
FROM comments 
GROUP BY my_field;
+----------+------------------+
| my_field | count(ticket_id) |
+----------+------------------+
| 0 | 6 |
| 1 | 7 |
| 2 | 7 |
+----------+------------------+

Now, if your PK is some sort of AUTO INCREMENT and "random" with respect to your data, you could perhaps use the _rowid pseudo-column (with MOD) to generate a sufficient degree of randomness for your requirements, and perform your aggregate queries?

---- structure and data in comments table---

CREATE TABLE `comments` 
(
 `comment_id` int(11) NOT NULL,
 `ticket_id` int(11) NOT NULL,
 PRIMARY KEY (`comment_id`)
);
mysql> SELECT * FROM comments;
+------------+-----------+
| comment_id | ticket_id |
+------------+-----------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
| 5 | 2 |
| 6 | 2 |
| 7 | 2 |
| 8 | 2 |
| 9 | 3 |
| 10 | 3 |
| 11 | 3 |
| 12 | 3 |
| 13 | 4 |
| 14 | 4 |
| 15 | 4 |
| 16 | 4 |
| 17 | 5 |
| 18 | 5 |
| 19 | 5 |
| 20 | 5 |
+------------+-----------+
answered Jul 19, 2014 at 11:04
-1

My understanding is that functions, such as rand(), are executed once per column reference to in SQL. Thus any query which scans the whole table will produce the same "random" value for every row. I think you're going to have to iterate through your data setting random numbers one at a time, RBAR-stylie.

answered Jul 18, 2014 at 23:16
2
  • Why the down vote? Commented Jul 19, 2014 at 11:02
  • I guess because what you say about RAND() is not true in MySQL: SQL-Fiddle (I guess you confused this issue with SQL-Server) Commented Jul 19, 2014 at 15:20

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.