I have a table messages
in a Postgres 9.4 database. Messages belong to feed_id
, and have posted_at
. Also, messages can have a parent message (in case of replies).
Table "public.messages"
Column | Type | Modifiers
------------------------------+-----------------------------+-----------
message_id | character varying(255) | not null
feed_id | integer |
parent_id | character varying(255) |
posted_at | timestamp without time zone |
share_count | integer |
Indexes:
"messages_pkey" PRIMARY KEY, btree (message_id)
"index_messages_on_feed_id_posted_at" btree (feed_id, posted_at DESC NULLS LAST)
I want to return all messages ordered by share_count
, but for each parent_id
, I only want to return one message. ie, if multiple messages have the same parent_id
, then only the latest one (posted_at
) is returned. The parent_id
can be null, messages with null parent_id
should all return.
The query I used is:
WITH filtered_messages AS (SELECT *
FROM messages
WHERE feed_id IN (7)
AND (posted_at >= '2015-01-01 04:00:00.000000')
AND (posted_at < '2015-04-28 04:00:00.000000'))
SELECT *
FROM (SELECT DISTINCT ON(COALESCE(parent_id, message_id)) parent_id,
message_id,
posted_at,
share_count
FROM filtered_messages
ORDER BY COALESCE(parent_id, message_id), posted_at DESC NULLS LAST
) messages
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;
Here's an SQL fiddle with schema, exact query, and expected result.
The performance of the query is slow once the messages
table gets big. I tried add multiple sorting indexes, but it does not seem to use the index.
Here's the EXPLAIN
output.
How can I create a correct index?
1 Answer 1
Query
This query should be substantially faster in any case:
SELECT parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NULL -- match index condition
UNION ALL
(
SELECT DISTINCT ON(parent_id)
parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NOT NULL -- match index condition
ORDER BY parent_id, posted_at DESC NULLS LAST
)
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;
The CTE does nothing here that a plain subquery could not deliver also. And a CTE introduces an optimization barrier since it is executed separately and its result is materialized.
Update: that changes in Postgres 12. See:
You had one more subquery-level than you actually need.
The expression (COALESCE(parent_id, message_id)
is not compatible with a plain index. You would need an index on that expression. But that may not be very useful either, depending on data distribution. See links below.
Splitting the simple case of parent_id IS NULL
into a separate SELECT
may or may not deliver the optimum. Especially not, if that's a rare case anyway, in which case a combined query with an index on (COALESCE(parent_id, message_id)
may perform better. Other considerations apply ...
Indices
Especially when supported with these indices:
CREATE INDEX messages_idx_null ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NULL;
CREATE INDEX messages_idx_notnull ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NOT NULL;
The two partial indices cover the whole table together and are about the same size together as a single total index.
Covering indexes (with the INCLUDE
clause) were added with Postgres 11. See:
The INCLUDE
part only make sense if you get index-only scans out of it. Else remove it from both indices.
Depending on missing details, DISTINCT ON
may or may not be the best query technique for the purpose. Details here:
Possibly faster alternatives here:
-
So far, for PostgreSQL, my understanding is that the only clauses that will trigger the index being used are
WHERE
,ORDER BY
andJOIN
query clauses (helpful article). Does that sound right?Akaisteph7– Akaisteph72023年08月11日 22:59:49 +00:00Commented Aug 11, 2023 at 22:59 -
@Akaisteph7 Basically, yes. But there are additional use cases. Like, for a full count. Or a small selection of columns from a covering index in an index-only scan. Or checking FK costraints for any writes ...Erwin Brandstetter– Erwin Brandstetter2023年08月12日 05:09:52 +00:00Commented Aug 12, 2023 at 5:09
-
Thanks. 1) Full count - do you have a link to more info? couldn't find anything on this. 2) Covering index - so in this case, an index could just be used from a bare
SELECT
. And 3) FK constraints - it seems all RDBMSs using SQL should auto-create indices for FKs so that shouldn't be something a dev has to actually worry about then.Akaisteph7– Akaisteph72023年08月14日 14:12:50 +00:00Commented Aug 14, 2023 at 14:12 -
@Akaisteph7: 1) Try
SELECT count(*) FROM tbl;
on any big table with a small PK index and its visibility map up to date (recently vacuumed). 2) Yes. 3) Postgres requires that the target of a FK constraint has a PK or UNIQUE constraint, so the target is always indexed. Indexing the source is up to the user, and I had plenty of cases where I didn't need an index on the source column. 4) Please ask any additional questions as question. Comments are not the place.Erwin Brandstetter– Erwin Brandstetter2023年08月16日 00:40:49 +00:00Commented Aug 16, 2023 at 0:40
Explore related questions
See similar questions with these tags.
ORDER BY
in the subquery is totally useless. Furthermore, the linked plan cannot be a result of the posted query - there is no mention ofmetadata
, for example.feed_id
andposted_at
and you did not mentionmetadata
at all, which seems to be a JSON type? Please repair your question to make it consistent. You select > 500k rows in the CTE ... How many rows are in the table? What percentage of rows do you typically select in the CTE? What percentage of rows hasparent_id IS NULL
? Consider the info in the [postgresql-performance] tag for performance questions.parent_id
? (min / avg / max)metadata
. Currently the messages table has 10 mil data, but increasing fast. I am think to separate into partition tables for each feed_id. Since I am only fetching per feed id. the percentage of parent_id null vs not null is about 60% / 40%. a typical fetch is around 1-2% of the table. ( around 100K messages) The performance for 100K is around 1s, but once gets to 500K+ it uses bitmap index and normally takes 10s.