SELECT optimization in Postgres 10

Question 1

I want to select last unique rows based on time, so:

 SELECT DISTINCT ON (track) *
 FROM eco.tracks WHERE id > (SELECT id FROM eco.tracks WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds') ORDER BY id DESC LIMIT 1)
 ORDER BY track, time_track DESC;

It takes 20 seconds, that's too slow.
If I replace id by the actual value, the query takes only 2 ms:

SELECT DISTINCT ON (track) *
FROM eco.tracks WHERE id > 48000000
ORDER BY track, time_track DESC;

That subquery alone takes only 2 ms.

SELECT id FROM eco.tracks
WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds')
ORDER BY id DESC
LIMIT 1

Why the big difference?

Question 2

Please consider instructions here for Postgres performance questions: dba.stackexchange.com/tags/postgresql-performance/info

Question 3

My crystal ball says the two queries end up using different query plans. You can verify by running EXPLAIN on each. More useful details with EXPLAIN (ANALYZE, BUFFERS).

The subquery gets the greatest id with time_track more than 5 minutes before the one of the greatest id overall in table eco.tracks. Assuming a strong positive correlation between id and time_track, humans get a rough idea of the work to be done here: it only concerns the (probably very few) latest rows within the last 5 minutes in the table. But Postgres has no idea at the time of planning the query.

(Aside: It may be a lingering logic bug to assume that serial IDs are in sync with timestamps in the column time_track. Not enough information to say more.)

For the first (slow) query, Postgres has to prepare a plan that best copes with any value that might come out of the subquery. That may very well be a sequential scan, which turns out to be a bad choice for only the few latest rows.

For the second (fast) query, Postgres knows from the given value what to expect. It can start by identifying the few relevant rows with an index or bitmap index scan on an index on (id) which undoubtably exists (again assuming that's the PK). The rest is peanuts.

Possible solutions

There are many ways. Anything to get Postgres to pick the appropriate query plan ...

1. Two separate queries

Dead simple. Build the second query with the actual value returned from the first. The downside: two round trips to the db server collecting two times network latency and varying overhead.

2. Dynamic SQL

CREATE OR REPLACE FUNCTION f_latest_tracks(_intv interval = '300 seconds')
 RETURNS SETOF eco.tracks AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT DISTINCT ON (track) *
 FROM eco.tracks
 WHERE id > 1ドル
 ORDER BY track, time_track DESC'
USING (
 SELECT id
 FROM eco.tracks
 WHERE time_track < (SELECT time_track - _intv FROM eco.tracks ORDER BY id DESC LIMIT 1)
 ORDER BY id DESC
 LIMIT 1
 );
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_latest_tracks();
SELECT * FROM f_latest_tracks('10 minutes');

Why? The manual about Executing Dynamic Commands:

The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Downside: You need to create a server-side function introducing a dependency on table eco.tracks. And you need to be comfortable with PL/pgSQL, of course.

Using a parameter default for conveniencse. See:

Add function parameter to SQL query WHERE clause

Question 4

Jzs! Just brilliant answer! Can u suggest some manuals/courses/books for PostgreSQL?! Thank you so much!

Question 5

I keep suggesting the manual: It's authoritative and mostly excellent. Enormous, though.

score 3 · Accepted Answer · 2019-06-23 23:50:05Z

My crystal ball says the two queries end up using different query plans. You can verify by running EXPLAIN on each. More useful details with EXPLAIN (ANALYZE, BUFFERS).

The subquery gets the greatest id with time_track more than 5 minutes before the one of the greatest id overall in table eco.tracks. Assuming a strong positive correlation between id and time_track, humans get a rough idea of the work to be done here: it only concerns the (probably very few) latest rows within the last 5 minutes in the table. But Postgres has no idea at the time of planning the query.

(Aside: It may be a lingering logic bug to assume that serial IDs are in sync with timestamps in the column time_track. Not enough information to say more.)

For the first (slow) query, Postgres has to prepare a plan that best copes with any value that might come out of the subquery. That may very well be a sequential scan, which turns out to be a bad choice for only the few latest rows.

For the second (fast) query, Postgres knows from the given value what to expect. It can start by identifying the few relevant rows with an index or bitmap index scan on an index on (id) which undoubtably exists (again assuming that's the PK). The rest is peanuts.

Possible solutions

There are many ways. Anything to get Postgres to pick the appropriate query plan ...

1. Two separate queries

Dead simple. Build the second query with the actual value returned from the first. The downside: two round trips to the db server collecting two times network latency and varying overhead.

2. Dynamic SQL

CREATE OR REPLACE FUNCTION f_latest_tracks(_intv interval = '300 seconds')
 RETURNS SETOF eco.tracks AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT DISTINCT ON (track) *
 FROM eco.tracks
 WHERE id > 1ドル
 ORDER BY track, time_track DESC'
USING (
 SELECT id
 FROM eco.tracks
 WHERE time_track < (SELECT time_track - _intv FROM eco.tracks ORDER BY id DESC LIMIT 1)
 ORDER BY id DESC
 LIMIT 1
 );
END
$func$ LANGUAGE plpgsql;

Call:

SELECT * FROM f_latest_tracks();
SELECT * FROM f_latest_tracks('10 minutes');

Why? The manual about Executing Dynamic Commands:

The important difference is that EXECUTE will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to use EXECUTE to positively ensure that a generic plan is not selected.

Downside: You need to create a server-side function introducing a dependency on table eco.tracks. And you need to be comfortable with PL/pgSQL, of course.

Using a parameter default for conveniencse. See:

Add function parameter to SQL query WHERE clause

Jzs! Just brilliant answer! Can u suggest some manuals/courses/books for PostgreSQL?! Thank you so much!
I keep suggesting the manual: It's authoritative and mostly excellent. Enormous, though.

Stack Exchange Network

SELECT optimization in Postgres 10

1 Answer 1

Possible solutions

1. Two separate queries

2. Dynamic SQL

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

SELECT optimization in Postgres 10

1 Answer 1

Possible solutions

1. Two separate queries

2. Dynamic SQL

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions