2

Here's a simplified version of the queries I am running:

select
 ...
from
 editions
 join edition_events on edition_events.edition_id = editions.id
 join events on events.id = edition_events.event_id
 join (
 select
 event_id,
 array_agg(date_start),
 array_agg(date_end)
 from
 event_dates
 group by
 event_id
 ) as dates on dates.event_id = events.id
WHERE
 editions.id = ?;

These queries ran great with my test data (~1500 events). Now that I'm starting to use a copy of the real data (~400,000 events, ~500,000 event_dates), my application is timing out before I get results back from the database. I have added indexes on all relevant columns, which only helped a little bit. I've narrowed down the problem to the subqueries that use aggregation.

Now the real version of these subqueries does things like time zone manipulation or a count instead of array_agg, so using a trigger to update a counter column is not practical in many instances. The dates associated with events must be individual rows rather than stored as an array since they have a foreign key relationship with other tables.

What would be my best bet for improving performance here? I understand that the subquery is being evaluated first, but I don't know which dates I actually need to look at until I've done the joins. I have looked at lateral joins, but I am uncertain if they would help here (and most information I can find is along the lines of "look, new feature!"). Since I am still using 9.2, I don't want to ask to upgrade to 9.3+ if it won't actually solve my problem.

Erwin Brandstetter
186k28 gold badges464 silver badges636 bronze badges
asked Dec 9, 2013 at 15:49

1 Answer 1

3

Your approach ("aggregate first, join later") generally performs best when you need to retrieve the whole table or large parts of it, or if you can restrict to relevant rows cheaply in the subquery - or while the aggregated table is small. See:

For only a few rows in the result, this gets disproportionately expensive for big tables, since the whole table (event_dates in your case) has to be processed. In these cases, it's better to revert the paradigm to "join first, aggregate later". The equivalent query would then be:

SELECT e.id
 , array_agg(ed.date_start) AS starts
 , array_agg(ed.date_end) AS ends
 , ...
FROM editions e
JOIN edition_events ee ON ee.edition_id = e.id
JOIN events ev ON ev.id = ee.event_id
JOIN event_dates ed ON ed.event_id = e.id
WHERE e.id = ?
GROUP BY e.id;

This should help with any version of Postgres.
Also using table aliases to simplify the syntax.

I don't see how LATERAL (Postgres 9.3+) could help this particular query. You would use it to reference a column from the left side of the join tree that you couldn't access on the same query level otherwise, often to produce multiple rows to join to. Neither is the case here. What you are thinking of can be implemented with plain old correlated subqueries, but those typically perform worse.
Don't take my word for it. Run a quick test with EXPLAIN ANALYZE for your case (with your real life data), then you know for sure.

answered Dec 9, 2013 at 16:54
2
  • Would the availability of lateral joins change anything? Or is this still the most efficient way no matter what? Commented Dec 9, 2013 at 17:14
  • @cimmanon: I added a paragraph for that. Commented Dec 9, 2013 at 17:43

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.