Slow queries related to subqueries using aggregation

Question 1

Here's a simplified version of the queries I am running:

select
 ...
from
 editions
 join edition_events on edition_events.edition_id = editions.id
 join events on events.id = edition_events.event_id
 join (
 select
 event_id,
 array_agg(date_start),
 array_agg(date_end)
 from
 event_dates
 group by
 event_id
 ) as dates on dates.event_id = events.id
WHERE
 editions.id = ?;

These queries ran great with my test data (~1500 events). Now that I'm starting to use a copy of the real data (~400,000 events, ~500,000 event_dates), my application is timing out before I get results back from the database. I have added indexes on all relevant columns, which only helped a little bit. I've narrowed down the problem to the subqueries that use aggregation.

Now the real version of these subqueries does things like time zone manipulation or a count instead of array_agg, so using a trigger to update a counter column is not practical in many instances. The dates associated with events must be individual rows rather than stored as an array since they have a foreign key relationship with other tables.

What would be my best bet for improving performance here? I understand that the subquery is being evaluated first, but I don't know which dates I actually need to look at until I've done the joins. I have looked at lateral joins, but I am uncertain if they would help here (and most information I can find is along the lines of "look, new feature!"). Since I am still using 9.2, I don't want to ask to upgrade to 9.3+ if it won't actually solve my problem.

Question 2

Your approach ("aggregate first, join later") generally performs best when you need to retrieve the whole table or large parts of it, or if you can restrict to relevant rows cheaply in the subquery - or while the aggregated table is small. See:

Query with LEFT JOIN not returning rows for count of 0

For only a few rows in the result, this gets disproportionately expensive for big tables, since the whole table (event_dates in your case) has to be processed. In these cases, it's better to revert the paradigm to "join first, aggregate later". The equivalent query would then be:

SELECT e.id
 , array_agg(ed.date_start) AS starts
 , array_agg(ed.date_end) AS ends
 , ...
FROM editions e
JOIN edition_events ee ON ee.edition_id = e.id
JOIN events ev ON ev.id = ee.event_id
JOIN event_dates ed ON ed.event_id = e.id
WHERE e.id = ?
GROUP BY e.id;

This should help with any version of Postgres.
Also using table aliases to simplify the syntax.

I don't see how LATERAL (Postgres 9.3+) could help this particular query. You would use it to reference a column from the left side of the join tree that you couldn't access on the same query level otherwise, often to produce multiple rows to join to. Neither is the case here. What you are thinking of can be implemented with plain old correlated subqueries, but those typically perform worse.
Don't take my word for it. Run a quick test with EXPLAIN ANALYZE for your case (with your real life data), then you know for sure.

Question 3

Would the availability of lateral joins change anything? Or is this still the most efficient way no matter what?

Question 4

@cimmanon: I added a paragraph for that.

score 3 · Accepted Answer · 2013-12-09 16:54:21Z

Your approach ("aggregate first, join later") generally performs best when you need to retrieve the whole table or large parts of it, or if you can restrict to relevant rows cheaply in the subquery - or while the aggregated table is small. See:

Query with LEFT JOIN not returning rows for count of 0

For only a few rows in the result, this gets disproportionately expensive for big tables, since the whole table (event_dates in your case) has to be processed. In these cases, it's better to revert the paradigm to "join first, aggregate later". The equivalent query would then be:

SELECT e.id
 , array_agg(ed.date_start) AS starts
 , array_agg(ed.date_end) AS ends
 , ...
FROM editions e
JOIN edition_events ee ON ee.edition_id = e.id
JOIN events ev ON ev.id = ee.event_id
JOIN event_dates ed ON ed.event_id = e.id
WHERE e.id = ?
GROUP BY e.id;

This should help with any version of Postgres.
Also using table aliases to simplify the syntax.

I don't see how LATERAL (Postgres 9.3+) could help this particular query. You would use it to reference a column from the left side of the join tree that you couldn't access on the same query level otherwise, often to produce multiple rows to join to. Neither is the case here. What you are thinking of can be implemented with plain old correlated subqueries, but those typically perform worse.
Don't take my word for it. Run a quick test with EXPLAIN ANALYZE for your case (with your real life data), then you know for sure.

Would the availability of lateral joins change anything? Or is this still the most efficient way no matter what?

Stack Exchange Network

Slow queries related to subqueries using aggregation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Slow queries related to subqueries using aggregation

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions