I have this simple query:
SELECT `value`, count(*) as `noHits`
FROM `topic`
WHERE identifier IN ('" . implode("','", array_keys($processedIdentifier)) ."')
GROUP BY `value`
ORDER BY count(*) DESC LIMIT 150
It generates a list of records such as:
keyword1, 100
keyword2, 90
keyword3, 40
keyword4, 40
keyword5, 40
keyword6, 40
...
keyword1500, 40
My challenge is to select a list of top 150 records ordered by count(*) DESC
however when there are more than 1 value having the same count(*) as in the list above keyword3 to keyword1500 select them randomly from the database.
It seems as if the selection is not random from the tests I made.
The result is used to generate a wordcloud and I do not want the same keywords to always be selected (out of thousands) when they have the same count(*)
2 Answers 2
To solve this I did the following (see the fiddle)
Created a table:
CREATE TABLE test -- 22 records
(
the_word TEXT,
no_of_times INTEGER
);
Populated it with the data at the end of this answer.
Then ran the following query:
SELECT the_word, no_of_times FROM test
ORDER BY no_of_times DESC, RAND() LIMIT 15;
I tested this several times and obtained results like the following. The first 6 records were always the same - keywords1-6. The remaining 9 records were always randomly from keyword7-22.
Result (sample - first 10 shown):
the_word no_of_times
keyword1 100
keyword2 90
keyword3 80
keyword4 70
keyword5 60
keyword6 50
keyword17 40
keyword19 40
keyword22 40
keyword18 40
This is considerably more elegant than my first answer - guess I overcomplicated things!:-)
=========================== Data (22 records) ================
INSERT INTO test VALUES ('keyword1', 100);
INSERT INTO test VALUES ('keyword2', 90);
INSERT INTO test VALUES ('keyword3', 80);
INSERT INTO test VALUES ('keyword4', 70);
INSERT INTO test VALUES ('keyword5', 60);
INSERT INTO test VALUES ('keyword6', 50);
INSERT INTO test VALUES ('keyword7', 40);
INSERT INTO test VALUES ('keyword8', 40);
INSERT INTO test VALUES ('keyword9', 40);
INSERT INTO test VALUES ('keyword10', 40);
INSERT INTO test VALUES ('keyword11', 40);
INSERT INTO test VALUES ('keyword12', 40);
INSERT INTO test VALUES ('keyword13', 40);
INSERT INTO test VALUES ('keyword14', 40);
INSERT INTO test VALUES ('keyword15', 40);
INSERT INTO test VALUES ('keyword16', 40);
INSERT INTO test VALUES ('keyword17', 40);
INSERT INTO test VALUES ('keyword18', 40);
INSERT INTO test VALUES ('keyword19', 40);
INSERT INTO test VALUES ('keyword20', 40);
INSERT INTO test VALUES ('keyword21', 40);
INSERT INTO test VALUES ('keyword22', 40);
-
Final comment in this thread! :-) I've done up another fiddle with two ways that COUNT(*) and SUM(no_of_times) can be used to perform the calculations - it is predicated on a slightly different interpretation of your original data which I believe is what Akina was thinking? At bottom of fiddle!Vérace– Vérace2019年09月21日 14:21:18 +00:00Commented Sep 21, 2019 at 14:21
This answer is far too complex - might be good for learning about CTEs, but check my much more elegant answer here.
This proved to be quite tricky - it involves chaining CTE
s (Common Table Expressions - aka the WITH clause
- an excelent site). It was made all the more difficult because of the fact that one isn't allowed to put SQL expressions into LIMIT
and OFFSET
clauses - the solution would have been considerably easier and more elegant.
I'll explain the logic that I used as I go along - there's a fiddle available here - you'll need MySQL 8
for Window (or Analytic) function. As I mentioned, I chained a series of CTE
s so I'll go through the process step by step.
First, I created and populated a table - see the bottom of this answer for that data or the fiddle.
Then I ran this:
WITH cte1 AS
(
SELECT 15 AS tot_num_recs_wanted -- the 15 could come from a table
),
cte2 AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY no_of_times DESC) AS rn
FROM test
)
SELECT * FROM cte2;
Result:
the_word no_of_times rn
keyword1 100 1
keyword2 90 2
keyword3 80 3
keyword4 70 4
keyword5 60 5
keyword6 50 6
keyword7 40 7
keyword8 40 8
keyword9 40 9
keyword10 40 10
-- further data snipped for brevity - see fiddle
-- this could have been made much easier if
-- could have used SQL in the LIMIT and OFFSET clauses
-- There would have been no need to use (another) CTE
-- or the ROW_NUMBER() Window function.
So, we can see that we require 15 records. So, we need to select the first 6 by default (no_of_times 50 - 100) and then 9 records randomly from the remaining 22 (no_of_times = 40).
-- above snipped, again for brevity
cte3 AS
(
SELECT rn, no_of_times nt
FROM cte2
WHERE rn = (SELECT tot_num_recs_wanted FROM cte1)
)
SELECT * FROM cte3;
Result:
rn nt
15 40
So, we know that we require 15 records and the cutoff is 40.
Then I ran:
...snipped...
cte4 AS
(
SELECT
*,
ROW_NUMBER() OVER (ORDER BY RAND()) AS rand_rn
FROM test
WHERE no_of_times = (SELECT nt FROM cte3) -- i.e. 40
)
SELECT * FROM cte4;
Result:
the_word no_of_times rand_rn
keyword22 40 1
keyword11 40 2
keyword9 40 3
keyword8 40 4
keyword10 40 5
keyword16 40 6
keyword7 40 7
keyword15 40 8
keyword13 40 9
keyword21 40 10
-- snipped...
So, now I have the 16 records that have no_of_times that are equal to 40 and, more importantly ordered using the RAND()
function.
Then, I select from that table using this SQL:
, cte5 AS
(
SELECT * FROM cte4
WHERE rand_rn <= (SELECT tot_num_recs_wanted FROM cte1)
- (SELECT COUNT(*) FROM test WHERE no_of_times > (SELECT nt FROM cte3))
)
SELECT * FROM cte5;
-- again, this could have been made much easier if
-- could have used SQL in the LIMIT and OFFSET clauses
-- There would have been no need to use (another) CTE
-- or the ROW_NUMBER() Window function.
Result:
the_word no_of_times rand_rn
keyword9 40 1
keyword18 40 2
keyword12 40 3
keyword11 40 4
keyword13 40 5
keyword22 40 6
keyword19 40 7
keyword15 40 8
keyword10 40 9
-- **__9 rows__** --
9 rows + the 6 that have no_of_points
greater than 40 gives the 15 required rows.
The key part of the above SQL is:
WHERE rand_rn <= (SELECT tot_num_recs_wanted FROM cte1)
- (SELECT COUNT(*) FROM test WHERE no_of_times > (SELECT nt FROM cte3))
This is 15 - 6 which is 9 - exactly what we needed for the random component of our records.
And finally, I ran a UNION
between my random records and the ones with no_of_points
above 40 like this:
SELECT the_word, no_of_times FROM cte5
UNION
SELECT the_word, no_of_times FROM test
WHERE no_of_times > (SELECT nt FROM cte3)
ORDER BY no_of_times DESC;
Result:
the_word no_of_times
keyword1 100
keyword2 90
keyword3 80
keyword4 70
keyword5 60
keyword6 50
keyword11 40
keyword14 40
keyword21 40
-- snipped --
The first 6 keywords are invariant, but if you run the fiddle several times, you will see that the keywords> 6 (i.e. no_of_points
= 40) change on each run.
I'm not sure how efficient this solution is, but suitable indexes might help. I hope this answers your question - if not let me know. p.s. welcome to the forum! :-)
============================== Table and data ======================
CREATE TABLE test -- 22 records
(
the_word TEXT,
no_of_times INTEGER
);
INSERT INTO test VALUES ('keyword1', 100);
INSERT INTO test VALUES ('keyword2', 90);
INSERT INTO test VALUES ('keyword3', 80);
INSERT INTO test VALUES ('keyword4', 70);
INSERT INTO test VALUES ('keyword5', 60);
INSERT INTO test VALUES ('keyword6', 50);
INSERT INTO test VALUES ('keyword7', 40);
INSERT INTO test VALUES ('keyword8', 40);
INSERT INTO test VALUES ('keyword9', 40);
INSERT INTO test VALUES ('keyword10', 40);
INSERT INTO test VALUES ('keyword11', 40);
INSERT INTO test VALUES ('keyword12', 40);
INSERT INTO test VALUES ('keyword13', 40);
INSERT INTO test VALUES ('keyword14', 40);
INSERT INTO test VALUES ('keyword15', 40);
INSERT INTO test VALUES ('keyword16', 40);
INSERT INTO test VALUES ('keyword17', 40);
INSERT INTO test VALUES ('keyword18', 40);
INSERT INTO test VALUES ('keyword19', 40);
INSERT INTO test VALUES ('keyword20', 40);
INSERT INTO test VALUES ('keyword21', 40);
INSERT INTO test VALUES ('keyword22', 40);
-
Thanks @Vérace for a really cool solution and for the welcome. I too thought it might be trickier than what first appears. We will test your implementation later. Unfortunately in our current production environment we still run MySQL 5.5.62-0+deb8u1 - (Debian),,, so we would need to wait before we can implement it..Ola– Ola2019年09月21日 09:09:30 +00:00Commented Sep 21, 2019 at 9:09
-
It appears that I spoke too soon. It appears that a modified version of Akina's solution works nicely on 5.6 - see the fiddle [here]( dbfiddle.uk/…). Yhould still get version 8 though :-)Vérace– Vérace2019年09月21日 11:10:35 +00:00Commented Sep 21, 2019 at 11:10
-
Simply adding RAND() after COUNT() in order does not work in our enivronment.. please see screen shots here: imgur.com/a/Dv668NU? ping @VèraceOla– Ola2019年09月21日 12:08:31 +00:00Commented Sep 21, 2019 at 12:08
-
-
It is fine - no need - I tried again in our environment and now it seems fine when adding the RAND() after COUNT(*) DESC.. strange however we will later validate with our solution that the keywords are in order and inline with your fiddle and Akinas solution.Ola– Ola2019年09月21日 12:32:27 +00:00Commented Sep 21, 2019 at 12:32
ORDER BY COUNT(*), RAND() LIMIT 150
SELECT
, but instead assumes a temp table. @Akina should present is 'trivial' modification as an Answer, not just a Comment.