How does ON predicate of Postgres LATERAL JOIN work?
Let me clarify question a bit. I've read the official documentation and a bunch of articles about this kind of JOIN. As far as I understood it is a foreach loop with a correlated subquery inside - it iterates over all records of a table A, allowing to reference columns of a "current" row in a correlated subquery B and join a result set of the B to that "current" row of A - if the B query returns 1 row there is only one pair, and if the B query return N rows there are N pairs with duplicated "current" row of the A. The same behavior like in usual JOINs.
But why is there a need in ON predicate? For me, in usual JOINs we use ON because we have a cartesian product of 2 tables to be filtered out, and it is not the case of LATERAL JOIN, which produces resulting pairs directly. In other words, in my developer experience I've only seen CROSS JOIN LATERAL and LEFT JOIN LATERAL () ON TRUE (the latter looks quite clumsy, though) but one day a colleague showed me
SELECT
r.acceptance_status, count(*) as count
FROM route r
LEFT JOIN LATERAL (
SELECT rts.route_id, array_agg(rts.shipment_id) shipment_ids
FROM route_to_shipment rts
where rts.route_id = r.route_id
GROUP BY rts.route_id
) rts using (route_id)
and this exploded my mind. Why using (route_id)
? We already have where rts.route_id = r.route_id
inside the subquery!!! Maybe I understand the mechanics of LATERAL joins wrong?
2 Answers 2
Short answer: LEFT JOIN
requires a join condition - as opposed to CROSS JOIN
. Basics in the manual.
See also:
But the join condition can still make sense to filter which rows to attach on the right side after having computed a set in the lateral subqery. Like:
SELECT r.acceptance_status
, count(*) AS count_routes
, count(rts.shipment_ids) AS count_routes_with_more_than_one_shipment
FROM route r
LEFT JOIN LATERAL (
SELECT array_agg(rts.shipment_id) shipment_ids
, count(*) AS shipments
FROM route_to_shipment rts
WHERE rts.route_id = r.route_id
-- GROUP BY rts.route_id -- just noise
) rts ON shipments > 1; -- !!!
This returns all rows from table route
, but only attaches shipment_ids
where more that one related row in table route_to_shipment
is found.
There is no need to add rts.route_id
to the SELECT
list of the subquery.
GROUP BY rts.route_id
is just noise after WHERE rts.route_id = r.route_id
.
And I am still generating the array shipment_ids
in vain, like your original.
Also demonstrating different results for count(*)
vs. count(shipment_ids)
.
The join condition cannot move to the WHERE
clause, where it would have a different effect. You might add a HAVING
clause to the suquery, though:
SELECT r.acceptance_status
, count(*) AS ct_routes
, count(rts.shipment_ids) AS ct_routes_with_more_than_1_shipment
FROM route r
LEFT JOIN LATERAL (
SELECT array_agg(rts.shipment_id) shipment_ids
FROM route_to_shipment rts
WHERE rts.route_id = r.route_id
HAVING count(*) > 1 -- !!!
) rts ON true
GROUP BY r.acceptance_status;
But there are lateral subqueries without aggregation (so no HAVING
clause possible). For your case:
SELECT r.acceptance_status
, count(*) AS ct_routes
, count(rts.shipment_ids) AS ct_routes_with_more_than_1_shipment
FROM route r
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT rts.shipment_id
FROM route_to_shipment rts
WHERE rts.route_id = r.route_id
) AS shipment_ids
) rts ON cardinality(shipment_ids) > 1 -- !!!
GROUP BY r.acceptance_status;
Only makes sense if we are going to use that array, of course. Then, an array constructor is probably the optimum for your query anyway. See:
CREATE TABLE ta (aid INT, a INT);
CREATE TABLE tb (aid INT, b INT);
INSERT INTO ta VALUES (1,10),(2,20);
INSERT INTO tb VALUES (1,100),(1,200);
SELECT * FROM ta LEFT JOIN LATERAL (SELECT * FROM tb WHERE tb.aid=ta.aid) ON true;
aid | a | aid | b
-----+----+------+------
1 | 10 | 1 | 100
1 | 10 | 1 | 200
2 | 20 | Null | Null
SELECT * FROM ta LEFT JOIN LATERAL (SELECT * FROM tb) USING (aid);
aid | a | b
-----+----+------
1 | 10 | 100
1 | 10 | 200
2 | 20 | Null
The USING (columns) clause does not duplicate the specified columns in the result set, whereas the ON (ta.column=tb.column) does duplicate the columns. Here the duplicated column is "aid". In the case of a standard JOIN on equality, the columns will be equal, so the duplication is useless, which means USING is preferable. It is also more readable. In the case of an outer JOIN (right,left,full) you may want the two columns to be duplicated in order to know if one of them is NULL.
If you want a CROSS JOIN (no ON condition):
SELECT * FROM ta CROSS JOIN LATERAL (SELECT * FROM tb WHERE tb.aid=ta.aid);
You can also use a JOIN and put move some of the conditions that would be in the WHERE of the LATERAL table into the ON() clause, the result is the same:
SELECT * FROM ta JOIN LATERAL (SELECT * FROM tb WHERE ...) ON (tb.aid=ta.aid);
But there is no CROSS LEFT JOIN so if you want a LEFT JOIN LATERAL you have to explicitly state LEFT JOIN, and that requires the ON clause.
SELECT * FROM ta JOIN LATERAL (SELECT * FROM tb WHERE tb.aid=ta.aid) ON true WHERE ta.aid<10;
Indeed in the case of a LATERAL join, the ON clause can be superfluous.
count(*)
is suspicious. What do you want to count exactly?