I have a requirement to sort a table by multiple criteria, select the first and associate with another table. The DB is Postgres.
Let's say we have players
and teams
where player has played from several teams. This is not the real composition of the model but should suffice for our example.
The requirement is to get the most recent record of the team
the player
played for, ordered by teams.end_date
if it's present, otherwise by teams.start_date
.
The algorithm would seem to be group by players.id
, and order by the criteria mentioned above. However all of my attempts to limit the results to 1 gives me the first team
record that is ordered by that criteria.
Here's what I think is the closest I got:
SELECT
latest_teams.*
FROM players
JOIN (
SELECT
player_id,
team_name,
start_date,
end_date
FROM teams
ORDER BY .... --this part is fine
limit 1
) AS latest_teams
ON players.id = teams.player_id
How can I get the scope of the inner query to be that of the player id?
-
You could try breaking the query into multiple parts. You could use a temp table to store the initial set of data, then select the next set of data by joining. The temp table could be replaced by a common table expression as well. I find that complex joins (based on results of subqueries, ones that contain OR's, etc) tend to get overly complex and difficult to debug later on. This is my opinion formed after years of rewriting other peoples code.datagod– datagod2016年10月04日 15:24:22 +00:00Commented Oct 4, 2016 at 15:24
1 Answer 1
First, if you SELECT
only from the teams
table and there is a FOREIGN KEY
constraint that REFERENCES players
, you don't need the join to players
at all.
Now, there are various ways to write this type of queries, there is even a tag at SO and DBA.SE, greatest-n-per-group. We need the greatest 1 in this case. The most simple code is to use the DISTINCT ON
construct (which is a Postgres, non-standard addition to SQL).
If you need columns from players
, you can simply join and add the columns in the select list.:
SELECT DISTINCT ON (t.player_id)
t.player_id,
t.team_name,
t.start_date,
t.end_date
FROM teams AS t
-- JOIN players AS p
-- ON p.id = t.player_id
ORDER BY
t.player_id, -- this needs to match the DISTINCT ON ()
.... -- pick what order you want, so the 1st is chosen for each player
; -- no LIMIT
Another way that works in most other DBMS is to use window functions. The criteria for which row to be chosen goes inside the OVER
clause:
SELECT
g.player_id,
g.team_name,
g.start_date,
g.end_date
FROM
( SELECT
t.player_id,
t.team_name,
t.start_date,
t.end_date,
ROW_NUMBER() OVER (PARTITION BY t.player_id
ORDER BY ...) AS rn
FROM teams AS t
-- JOIN players AS p
-- ON p.id = t.player_id
) AS g
WHERE
g.rn = 1 ;
If you are not in some ancient version of Postgres, there is also the LATERAL
syntax, that is often the most efficient. Especially when there is a "driving" table with the distinct values we want to base the grouping (the players
here) and if these values are relatively small in number and have many possible options (i.e. in this case, the biggest the teams
table is, in comparison to the players
, the better for this query).
Also notice how this query resembles very much your initial idea. It does, literally, exactly what you want: allow the players
columns to be referenced inside the lt
subquery:
SELECT
lt.player_id,
lt.team_name,
lt.start_date,
lt.end_date
FROM players AS p
JOIN LATERAL
( SELECT t.*
FROM teams AS t
WHERE p.id = t.player_id
ORDER BY .... -- pick the order
LIMIT 1
) AS lt ON TRUE ;
-
I used the lateral join and it worked, but it is causing the performance to suffer. It's a huge query, and it went from about .5-1 sec to execute to now being like 5+ seconds. Is that what you'd expect? It's important to note that the order by is 4 columns, plus a derived value from a case statement.Joe Essey– Joe Essey2016年10月04日 22:21:18 +00:00Commented Oct 4, 2016 at 22:21
-
Indexes would help but which index to add depends on the ordering condition. For example, if you want to find the team with the latest
start_date
, an index on(player_id, start_date)
would be fine.ypercubeᵀᴹ– ypercubeᵀᴹ2016年10月04日 22:30:48 +00:00Commented Oct 4, 2016 at 22:30 -
Which query to choose will depend on the distribution of your data. How many rows do the 2 tables have? Of course you can try and test performance of all the different ways to write this (there are more!)ypercubeᵀᴹ– ypercubeᵀᴹ2016年10月04日 22:31:51 +00:00Commented Oct 4, 2016 at 22:31
Explore related questions
See similar questions with these tags.