How to use index to speed up sorting in Postgres?

Question 1

I have a table messages in a Postgres 9.4 database. Messages belong to feed_id, and have posted_at. Also, messages can have a parent message (in case of replies).

 Table "public.messages"
 Column | Type | Modifiers
------------------------------+-----------------------------+-----------
 message_id | character varying(255) | not null
 feed_id | integer |
 parent_id | character varying(255) |
 posted_at | timestamp without time zone |
 share_count | integer |
Indexes:
 "messages_pkey" PRIMARY KEY, btree (message_id)
 "index_messages_on_feed_id_posted_at" btree (feed_id, posted_at DESC NULLS LAST)

I want to return all messages ordered by share_count, but for each parent_id, I only want to return one message. ie, if multiple messages have the same parent_id, then only the latest one (posted_at) is returned. The parent_id can be null, messages with null parent_id should all return.

The query I used is:

WITH filtered_messages AS (SELECT * 
 FROM messages
 WHERE feed_id IN (7) 
 AND (posted_at >= '2015-01-01 04:00:00.000000') 
 AND (posted_at < '2015-04-28 04:00:00.000000'))
SELECT *
FROM (SELECT DISTINCT ON(COALESCE(parent_id, message_id)) parent_id,
 message_id, 
 posted_at, 
 share_count
 FROM filtered_messages
 ORDER BY COALESCE(parent_id, message_id), posted_at DESC NULLS LAST
 ) messages
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;

Here's an SQL fiddle with schema, exact query, and expected result.

The performance of the query is slow once the messages table gets big. I tried add multiple sorting indexes, but it does not seem to use the index.

Here's the EXPLAIN output.

How can I create a correct index?

Question 2

At a first glance, the ORDER BY in the subquery is totally useless. Furthermore, the linked plan cannot be a result of the posted query - there is no mention of metadata, for example.

Question 3

Your description does not cover the role of feed_id and posted_at and you did not mention metadata at all, which seems to be a JSON type? Please repair your question to make it consistent. You select > 500k rows in the CTE ... How many rows are in the table? What percentage of rows do you typically select in the CTE? What percentage of rows has parent_id IS NULL? Consider the info in the [postgresql-performance] tag for performance questions.

Question 4

Also important: How many rows for each parent_id? (min / avg / max)

Question 5

sorry, I was trying to make the question more clear by reducing some of the columns, share_count was actually in hstore metadata. Currently the messages table has 10 mil data, but increasing fast. I am think to separate into partition tables for each feed_id. Since I am only fetching per feed id. the percentage of parent_id null vs not null is about 60% / 40%. a typical fetch is around 1-2% of the table. ( around 100K messages) The performance for 100K is around 1s, but once gets to 500K+ it uses bitmap index and normally takes 10s.

Question 6

Query

This query should be substantially faster in any case:

SELECT parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NULL -- match index condition
UNION ALL
(
SELECT DISTINCT ON(parent_id)
 parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NOT NULL -- match index condition
ORDER BY parent_id, posted_at DESC NULLS LAST
)
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;

The CTE does nothing here that a plain subquery could not deliver also. And a CTE introduces an optimization barrier since it is executed separately and its result is materialized.
Update: that changes in Postgres 12. See:

How to prevent PostgreSQL from rewriting OUTER JOIN queries?

You had one more subquery-level than you actually need.

The expression (COALESCE(parent_id, message_id) is not compatible with a plain index. You would need an index on that expression. But that may not be very useful either, depending on data distribution. See links below.

Splitting the simple case of parent_id IS NULL into a separate SELECT may or may not deliver the optimum. Especially not, if that's a rare case anyway, in which case a combined query with an index on (COALESCE(parent_id, message_id) may perform better. Other considerations apply ...

Indices

Especially when supported with these indices:

CREATE INDEX messages_idx_null ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NULL;
CREATE INDEX messages_idx_notnull ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NOT NULL;

The two partial indices cover the whole table together and are about the same size together as a single total index.

Covering indexes (with the INCLUDE clause) were added with Postgres 11. See:

Do covering indexes in PostgreSQL help JOIN columns?

The INCLUDE part only make sense if you get index-only scans out of it. Else remove it from both indices.

db<>fiddle here
_{Old sqlfiddle}

Depending on missing details, DISTINCT ON may or may not be the best query technique for the purpose. Details here:

Select first row in each GROUP BY group?

Possibly faster alternatives here:

Optimize GROUP BY query to retrieve latest record per user

Question 7

So far, for PostgreSQL, my understanding is that the only clauses that will trigger the index being used are WHERE, ORDER BY and JOIN query clauses (helpful article). Does that sound right?

Question 8

@Akaisteph7 Basically, yes. But there are additional use cases. Like, for a full count. Or a small selection of columns from a covering index in an index-only scan. Or checking FK costraints for any writes ...

Question 9

Thanks. 1) Full count - do you have a link to more info? couldn't find anything on this. 2) Covering index - so in this case, an index could just be used from a bare SELECT. And 3) FK constraints - it seems all RDBMSs using SQL should auto-create indices for FKs so that shouldn't be something a dev has to actually worry about then.

Question 10

@Akaisteph7: 1) Try SELECT count(*) FROM tbl; on any big table with a small PK index and its visibility map up to date (recently vacuumed). 2) Yes. 3) Postgres requires that the target of a FK constraint has a PK or UNIQUE constraint, so the target is always indexed. Indexing the source is up to the user, and I had plenty of cases where I didn't need an index on the source column. 4) Please ask any additional questions as question. Comments are not the place.

score 14 · Answer 1 · 2015-04-28 04:16:33Z

Query

This query should be substantially faster in any case:

SELECT parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NULL -- match index condition
UNION ALL
(
SELECT DISTINCT ON(parent_id)
 parent_id, message_id, posted_at, share_count
FROM messages
WHERE feed_id = 7
AND posted_at >= '2015-01-01 4:0:0'
AND posted_at < '2015-04-28 4:0:0'
AND parent_id IS NOT NULL -- match index condition
ORDER BY parent_id, posted_at DESC NULLS LAST
)
ORDER BY share_count DESC NULLS LAST, posted_at DESC NULLS LAST;

The CTE does nothing here that a plain subquery could not deliver also. And a CTE introduces an optimization barrier since it is executed separately and its result is materialized.
Update: that changes in Postgres 12. See:

How to prevent PostgreSQL from rewriting OUTER JOIN queries?

You had one more subquery-level than you actually need.

The expression (COALESCE(parent_id, message_id) is not compatible with a plain index. You would need an index on that expression. But that may not be very useful either, depending on data distribution. See links below.

Splitting the simple case of parent_id IS NULL into a separate SELECT may or may not deliver the optimum. Especially not, if that's a rare case anyway, in which case a combined query with an index on (COALESCE(parent_id, message_id) may perform better. Other considerations apply ...

Indices

Especially when supported with these indices:

CREATE INDEX messages_idx_null ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NULL;
CREATE INDEX messages_idx_notnull ON messages (feed_id, posted_at DESC NULLS LAST, share_count DESC NULLS LAST)
INCLUDE (parent_id, message_id)
WHERE parent_id IS NOT NULL;

The two partial indices cover the whole table together and are about the same size together as a single total index.

Covering indexes (with the INCLUDE clause) were added with Postgres 11. See:

Do covering indexes in PostgreSQL help JOIN columns?

The INCLUDE part only make sense if you get index-only scans out of it. Else remove it from both indices.

db<>fiddle here
_{Old sqlfiddle}

Depending on missing details, DISTINCT ON may or may not be the best query technique for the purpose. Details here:

Select first row in each GROUP BY group?

Possibly faster alternatives here:

Optimize GROUP BY query to retrieve latest record per user

So far, for PostgreSQL, my understanding is that the only clauses that will trigger the index being used are WHERE, ORDER BY and JOIN query clauses (helpful article). Does that sound right?
@Akaisteph7 Basically, yes. But there are additional use cases. Like, for a full count. Or a small selection of columns from a covering index in an index-only scan. Or checking FK costraints for any writes ...
Thanks. 1) Full count - do you have a link to more info? couldn't find anything on this. 2) Covering index - so in this case, an index could just be used from a bare SELECT. And 3) FK constraints - it seems all RDBMSs using SQL should auto-create indices for FKs so that shouldn't be something a dev has to actually worry about then.
@Akaisteph7: 1) Try SELECT count(*) FROM tbl; on any big table with a small PK index and its visibility map up to date (recently vacuumed). 2) Yes. 3) Postgres requires that the target of a FK constraint has a PK or UNIQUE constraint, so the target is always indexed. Indexing the source is up to the user, and I had plenty of cases where I didn't need an index on the source column. 4) Please ask any additional questions as question. Comments are not the place.

Stack Exchange Network

How to use index to speed up sorting in Postgres?

1 Answer 1

Query

Indices

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

How to use index to speed up sorting in Postgres?

1 Answer 1

Query

Indices

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions