How to limit an inner query by outer query attribute

Question 1

I have a requirement to sort a table by multiple criteria, select the first and associate with another table. The DB is Postgres.

Let's say we have players and teams where player has played from several teams. This is not the real composition of the model but should suffice for our example.

The requirement is to get the most recent record of the team the player played for, ordered by teams.end_date if it's present, otherwise by teams.start_date.

The algorithm would seem to be group by players.id, and order by the criteria mentioned above. However all of my attempts to limit the results to 1 gives me the first team record that is ordered by that criteria.

Here's what I think is the closest I got:

SELECT 
 latest_teams.* 
FROM players
 JOIN (
 SELECT 
 player_id,
 team_name,
 start_date,
 end_date
 FROM teams
 ORDER BY .... --this part is fine
 limit 1
 ) AS latest_teams
 ON players.id = teams.player_id

How can I get the scope of the inner query to be that of the player id?

Question 2

You could try breaking the query into multiple parts. You could use a temp table to store the initial set of data, then select the next set of data by joining. The temp table could be replaced by a common table expression as well. I find that complex joins (based on results of subqueries, ones that contain OR's, etc) tend to get overly complex and difficult to debug later on. This is my opinion formed after years of rewriting other peoples code.

Question 3

First, if you SELECT only from the teams table and there is a FOREIGN KEY constraint that REFERENCES players, you don't need the join to players at all.

Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, greatest-n-per-group. We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players, you can simply join and add the columns in the select list.:

SELECT DISTINCT ON (t.player_id)
 t.player_id,
 t.team_name,
 t.start_date,
 t.end_date
FROM teams AS t
 -- JOIN players AS p 
 -- ON p.id = t.player_id
ORDER BY 
 t.player_id, -- this needs to match the DISTINCT ON ()
 .... -- pick what order you want, so the 1st is chosen for each player
 ; -- no LIMIT

Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER clause:

SELECT 
 g.player_id,
 g.team_name,
 g.start_date,
 g.end_date
FROM
 ( SELECT 
 t.player_id,
 t.team_name,
 t.start_date,
 t.end_date,
 ROW_NUMBER() OVER (PARTITION BY t.player_id 
 ORDER BY ...) AS rn
 FROM teams AS t
 -- JOIN players AS p 
 -- ON p.id = t.player_id
 ) AS g
WHERE
 g.rn = 1 ;

If you are not in some ancient version of Postgres, there is also the LATERAL syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams table is, in comparison to the players, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players columns to be referenced inside the lt subquery:

SELECT
 lt.player_id,
 lt.team_name,
 lt.start_date,
 lt.end_date
FROM players AS p
 JOIN LATERAL
 ( SELECT t.*
 FROM teams AS t 
 WHERE p.id = t.player_id
 ORDER BY .... -- pick the order
 LIMIT 1 
 ) AS lt ON TRUE ;

Question 4

I used the lateral join and it worked, but it is causing the performance to suffer. It's a huge query, and it went from about .5-1 sec to execute to now being like 5+ seconds. Is that what you'd expect? It's important to note that the order by is 4 columns, plus a derived value from a case statement.

Question 5

Indexes would help but which index to add depends on the ordering condition. For example, if you want to find the team with the latest start_date, an index on (player_id, start_date) would be fine.

Question 6

Which query to choose will depend on the distribution of your data. How many rows do the 2 tables have? Of course you can try and test performance of all the different ways to write this (there are more!)

ypercubeTM ypercubeTM 99.7k13 gold badges217 silver badges306 bronze badges · Accepted Answer · 2016-10-04 16:42:31Z

First, if you SELECT only from the teams table and there is a FOREIGN KEY constraint that REFERENCES players, you don't need the join to players at all.

Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, greatest-n-per-group. We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players, you can simply join and add the columns in the select list.:

SELECT DISTINCT ON (t.player_id)
 t.player_id,
 t.team_name,
 t.start_date,
 t.end_date
FROM teams AS t
 -- JOIN players AS p 
 -- ON p.id = t.player_id
ORDER BY 
 t.player_id, -- this needs to match the DISTINCT ON ()
 .... -- pick what order you want, so the 1st is chosen for each player
 ; -- no LIMIT

Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER clause:

SELECT 
 g.player_id,
 g.team_name,
 g.start_date,
 g.end_date
FROM
 ( SELECT 
 t.player_id,
 t.team_name,
 t.start_date,
 t.end_date,
 ROW_NUMBER() OVER (PARTITION BY t.player_id 
 ORDER BY ...) AS rn
 FROM teams AS t
 -- JOIN players AS p 
 -- ON p.id = t.player_id
 ) AS g
WHERE
 g.rn = 1 ;

If you are not in some ancient version of Postgres, there is also the LATERAL syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams table is, in comparison to the players, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players columns to be referenced inside the lt subquery:

SELECT
 lt.player_id,
 lt.team_name,
 lt.start_date,
 lt.end_date
FROM players AS p
 JOIN LATERAL
 ( SELECT t.*
 FROM teams AS t 
 WHERE p.id = t.player_id
 ORDER BY .... -- pick the order
 LIMIT 1 
 ) AS lt ON TRUE ;

I used the lateral join and it worked, but it is causing the performance to suffer. It's a huge query, and it went from about .5-1 sec to execute to now being like 5+ seconds. Is that what you'd expect? It's important to note that the order by is 4 columns, plus a derived value from a case statement.
Indexes would help but which index to add depends on the ordering condition. For example, if you want to find the team with the latest start_date, an index on (player_id, start_date) would be fine.
Which query to choose will depend on the distribution of your data. How many rows do the 2 tables have? Of course you can try and test performance of all the different ways to write this (there are more!)

Stack Exchange Network

How to limit an inner query by outer query attribute

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to limit an inner query by outer query attribute

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions