I want to select last unique rows based on time, so:
SELECT DISTINCT ON (track) *
FROM eco.tracks WHERE id > (SELECT id FROM eco.tracks WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds') ORDER BY id DESC LIMIT 1)
ORDER BY track, time_track DESC;
It takes 20 seconds, that's too slow.
If I replace id
by the actual value, the query takes only 2 ms:
SELECT DISTINCT ON (track) *
FROM eco.tracks WHERE id > 48000000
ORDER BY track, time_track DESC;
That subquery alone takes only 2 ms.
SELECT id FROM eco.tracks
WHERE time_track < ((SELECT time_track FROM eco.tracks ORDER BY id DESC LIMIT 1) - INTERVAL '300 seconds')
ORDER BY id DESC
LIMIT 1
Why the big difference?
-
Please consider instructions here for Postgres performance questions: dba.stackexchange.com/tags/postgresql-performance/infoErwin Brandstetter– Erwin Brandstetter2019年06月23日 23:51:24 +00:00Commented Jun 23, 2019 at 23:51
1 Answer 1
My crystal ball says the two queries end up using different query plans. You can verify by running EXPLAIN
on each. More useful details with EXPLAIN (ANALYZE, BUFFERS)
.
The subquery gets the greatest id
with time_track
more than 5 minutes before the one of the greatest id
overall in table eco.tracks
. Assuming a strong positive correlation between id
and time_track
, humans get a rough idea of the work to be done here: it only concerns the (probably very few) latest rows within the last 5 minutes in the table. But Postgres has no idea at the time of planning the query.
(Aside: It may be a lingering logic bug to assume that serial IDs are in sync with timestamps in the column time_track
. Not enough information to say more.)
For the first (slow) query, Postgres has to prepare a plan that best copes with any value that might come out of the subquery. That may very well be a sequential scan, which turns out to be a bad choice for only the few latest rows.
For the second (fast) query, Postgres knows from the given value what to expect. It can start by identifying the few relevant rows with an index or bitmap index scan on an index on (id)
which undoubtably exists (again assuming that's the PK). The rest is peanuts.
Possible solutions
There are many ways. Anything to get Postgres to pick the appropriate query plan ...
1. Two separate queries
Dead simple. Build the second query with the actual value returned from the first. The downside: two round trips to the db server collecting two times network latency and varying overhead.
2. Dynamic SQL
CREATE OR REPLACE FUNCTION f_latest_tracks(_intv interval = '300 seconds')
RETURNS SETOF eco.tracks AS
$func$
BEGIN
RETURN QUERY EXECUTE
'SELECT DISTINCT ON (track) *
FROM eco.tracks
WHERE id > 1ドル
ORDER BY track, time_track DESC'
USING (
SELECT id
FROM eco.tracks
WHERE time_track < (SELECT time_track - _intv FROM eco.tracks ORDER BY id DESC LIMIT 1)
ORDER BY id DESC
LIMIT 1
);
END
$func$ LANGUAGE plpgsql;
Call:
SELECT * FROM f_latest_tracks();
SELECT * FROM f_latest_tracks('10 minutes');
Why? The manual about Executing Dynamic Commands:
The important difference is that
EXECUTE
will re-plan the command on each execution, generating a plan that is specific to the current parameter values; whereas PL/pgSQL may otherwise create a generic plan and cache it for re-use. In situations where the best plan depends strongly on the parameter values, it can be helpful to useEXECUTE
to positively ensure that a generic plan is not selected.
Downside: You need to create a server-side function introducing a dependency on table eco.tracks
. And you need to be comfortable with PL/pgSQL, of course.
Using a parameter default for conveniencse. See:
-
Jzs! Just brilliant answer! Can u suggest some manuals/courses/books for PostgreSQL?! Thank you so much!delkov– delkov2019年06月24日 18:37:27 +00:00Commented Jun 24, 2019 at 18:37
-
I keep suggesting the manual: It's authoritative and mostly excellent. Enormous, though.Erwin Brandstetter– Erwin Brandstetter2019年06月24日 23:47:38 +00:00Commented Jun 24, 2019 at 23:47
Explore related questions
See similar questions with these tags.