Skip to content

code.openark.org

Blog by Shlomi Noach

SQL: selecting top N records per group

A while back I presented(*) an SQL trick to present with non-aggregated column on a GROUP BY query, without use of subquery or derived tables.

Based on a similar concept, combined with string walking, I now present a query which selects top-n records for each group, ordered by some condition. It will require no subqueries. It executes faster than its more conventional alternatives.

[UPDATE: this is MySQL only. Others can use Window Functions where available]

Using the simple world database, we answer the following question:

What are the top 5 largest (by area) countries for each continent? What are their names, surface area and population?

Similar questions would be:

What were the latest 5 films rented by each customer?

What were the most presented advertisements for each user?

etc.

Step 1: getting the top

We already know how to get a single column’s value for the top country, as presented in the aforementioned post:

SELECT
 Continent,
 SUBSTRING_INDEX(
 GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
 ',', 1) AS Name
FROM
 Country
GROUP BY
 Continent
;
+---------------+--------------------+
| Continent   | Name        |
+---------------+--------------------+
| Asia     | China       |
| Europe    | Russian Federation |
| North America | Canada       |
| Africa    | Sudan       |
| Oceania    | Australia     |
| Antarctica  | Antarctica     |
| South America | Brazil       |
+---------------+--------------------+

Step 2: adding columns

This part is easy: just throw in the rest of the columns (again, only indicating the top country in each continent)

SELECT
 Continent,
 SUBSTRING_INDEX(
 GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
 ',', 1) AS Name,
 SUBSTRING_INDEX(
 GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
 ',', 1) AS SurfaceArea,
 SUBSTRING_INDEX(
 GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
 ',', 1) AS Population
FROM
 Country
GROUP BY
 Continent
;
+---------------+--------------------+-------------+------------+
| Continent   | Name        | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia     | China       | 9572900.00 | 1277558000 |
| Europe    | Russian Federation | 17075400.00 | 146934000 |
| North America | Canada       | 9970610.00 | 31147000  |
| Africa    | Sudan       | 2505813.00 | 29490000  |
| Oceania    | Australia     | 7741220.00 | 18886000  |
| Antarctica  | Antarctica     | 13120000.00 | 0     |
| South America | Brazil       | 8547403.00 | 170115000 |
+---------------+--------------------+-------------+------------+

Step 3: casting

You’ll notice that the Population column from this last execution is aligned to the left. This is because it is believed to be a string. The GROUP_CONCAT clause concatenates values in one string, and SUBSTRING_INDEX parses a substring. The same applies to the SurfaceArea column. We’ll cast Population as UNSIGNED and SurfaceArea as DECIMAL:

SELECT
 Continent,
 SUBSTRING_INDEX(
 GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
 ',', 1) AS Name,
 CAST(
 SUBSTRING_INDEX(
 GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
 ',', 1)
 AS DECIMAL(20,2)
 ) AS SurfaceArea,
 CAST(
 SUBSTRING_INDEX(
 GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
 ',', 1)
 AS UNSIGNED
 ) AS Population
FROM
 Country
GROUP BY
 Continent
;
+---------------+--------------------+-------------+------------+
| Continent   | Name        | SurfaceArea | Population |
+---------------+--------------------+-------------+------------+
| Asia     | China       | 9572900.00 | 1277558000 |
| Europe    | Russian Federation | 17075400.00 | 146934000 |
| North America | Canada       | 9970610.00 |  31147000 |
| Africa    | Sudan       | 2505813.00 |  29490000 |
| Oceania    | Australia     | 7741220.00 |  18886000 |
| Antarctica  | Antarctica     | 13120000.00 |     0 |
| South America | Brazil       | 8547403.00 | 170115000 |
+---------------+--------------------+-------------+------------+

Step 4: top n records

It’s time to use string walking. Examples for string walking (described in the excellent SQL Cookbook) can be found here, here and here. We’ll be using a numbers table: a simple table which lists ascending integer numbers. For example, you can use the following:

DROP TABLE IF EXISTS `tinyint_asc`;
CREATE TABLE `tinyint_asc` (
 `value` tinyint(3) unsigned NOT NULL default '0',
 PRIMARY KEY (value)
) ;
INSERT INTO `tinyint_asc` VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250),(251),(252),(253),(254),(255);

The trick is to apply the same technique as used above, not for a single row, but for several rows. Here’s how to present the top 5 countries:

SELECT
 Continent,
 SUBSTRING_INDEX(
 SUBSTRING_INDEX(
 GROUP_CONCAT(Name ORDER BY SurfaceArea DESC),
 ',', value),
 ',', -1)
 AS Name,
 CAST(
 SUBSTRING_INDEX(
 SUBSTRING_INDEX(
 GROUP_CONCAT(SurfaceArea ORDER BY SurfaceArea DESC),
 ',', value),
 ',', -1)
 AS DECIMAL(20,2)
 ) AS SurfaceArea,
 CAST(
 SUBSTRING_INDEX(
 SUBSTRING_INDEX(
 GROUP_CONCAT(Population ORDER BY SurfaceArea DESC),
 ',', value),
 ',', -1)
 AS UNSIGNED
 ) AS Population
FROM
 Country, tinyint_asc
WHERE
 tinyint_asc.value >= 1 AND tinyint_asc.value <= 5
GROUP BY
 Continent, value
;
+---------------+----------------------------------------------+-------------+------------+
| Continent   | Name                     | SurfaceArea | Population |
+---------------+----------------------------------------------+-------------+------------+
| Asia     | China                    | 9572900.00 | 1277558000 |
| Asia     | India                    | 3287263.00 | 1013662000 |
| Asia     | Kazakstan                  | 2724900.00 |  16223000 |
| Asia     | Saudi Arabia                 | 2149690.00 |  21607000 |
| Asia     | Indonesia                  | 1904569.00 | 212107000 |
| Europe    | Russian Federation              | 17075400.00 | 146934000 |
| Europe    | Ukraine                   |  603700.00 |  50456000 |
| Europe    | France                    |  551500.00 |  59225700 |
| Europe    | Spain                    |  505992.00 |  39441700 |
| Europe    | Sweden                    |  449964.00 |  8861400 |
| North America | Canada                    | 9970610.00 |  31147000 |
| North America | United States                | 9363520.00 | 278357000 |
| North America | Greenland                  | 2166090.00 |   56000 |
| North America | Mexico                    | 1958201.00 |  98881000 |
| North America | Nicaragua                  |  130000.00 |  5074000 |
| Africa    | Sudan                    | 2505813.00 |  29490000 |
| Africa    | Algeria                   | 2381741.00 |  31471000 |
| Africa    | Congo                    | 2344858.00 |  51654000 |
| Africa    | The Democratic Republic of the       | 1759540.00 |  5605000 |
| Africa    | Libyan Arab Jamahiriya            | 1284000.00 |  7651000 |
| Oceania    | Australia                  | 7741220.00 |  18886000 |
| Oceania    | Papua New Guinea               |  462840.00 |  4807000 |
| Oceania    | New Zealand                 |  270534.00 |  3862000 |
| Oceania    | Solomon Islands               |  28896.00 |   444000 |
| Oceania    | New Caledonia                |  18575.00 |   214000 |
| Antarctica  | Antarctica                  | 13120000.00 |     0 |
| Antarctica  | French Southern territories         |   7780.00 |     0 |
| Antarctica  | South Georgia and the South Sandwich Islands |   3903.00 |     0 |
| Antarctica  | Heard Island and McDonald Islands      |   359.00 |     0 |
| Antarctica  | Bouvet Island                |    59.00 |     0 |
| South America | Brazil                    | 8547403.00 | 170115000 |
| South America | Argentina                  | 2780400.00 |  37032000 |
| South America | Peru                     | 1285216.00 |  25662000 |
| South America | Colombia                   | 1138914.00 |  42321000 |
| South America | Bolivia                   | 1098581.00 |  8329000 |
+---------------+----------------------------------------------+-------------+------------+

Limitations

You should have:

  • Enough numbers in the numbers table (I’ve used 5 out of 255)
  • Reasonable setting for group_concat_max_len (see this post). Actually it would be better to have a smaller value here, while you make sure it’s large enough; this way you do not waste memory for large groups.

(*) This was two years ago! I’m getting old

Update: see also

Another hack at same problem: SQL: selecting top N records per group, another solution

27 thoughts on “SQL: selecting top N records per group

  1. @Erich,

    Yes, the use of commas assume no commas in values; I also sometimes use char(0) or char(9) which nobody uses within texts

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

AltStyle によって変換されたページ (->オリジナル) /