My understanding is that you want to decide on what to include in your index based what will trigger the index being used.
The following considerations are what I am aware of, in order of relevance:
- (Most common) Any columns used with the following query clauses:
WHERE
,ORDER BY
,JOIN
- Index-only scans: getting data from a small selection of columns included in a covering index. In this case, data from an index could be used from a column just included in the
SELECT
clause, but the query will likely still contain a column in aWHERE
clause that should also be indexed. See PostgreSQL's implementation for more details. - Getting a full count: something like
SELECT count(*) FROM tbl;
(It's not clear to me why an index would be used for this one) - For the checking of Foreign Key constraints: for verifying a record exists before allowing it to be added as a Foreign Key, or for deleting/setting to null/etc. records in tables
WHERE colx_withFK = _id
if a record with id_id
is deleted in the source table, etc. PostgreSQL auto-creates indices for FKs, but other RDBMSs (like SQL Server) might not, so this one might not be something a developer has to actually worry about.
And for all of these considerations, of course, anything you execute often is more important than a less used query.
Question: Did I get something wrong? Is there anything else?
As details can be very different between SQL implementations, answers can focus on PostgreSQL.
1 Answer 1
There are a few things you forgot:
creating an index on an expression makes PostgreSQL collect statistics for the expression (from v15 on, you can alternatively use extended statistics for that purpose)
unique indexes can implement constraints
There are also partial indexes, where part of the WHERE
condition can be implemented as a WHERE
clause for the index.
But I think that you won't be able to come up with a useful list that "semi-automatically" enables you to index a database well without further considerations, like
Which queries are executed how often, and how important is their performance?
Is the benefit in query speed worth the performance loss in data modifications?
Shall I add two specific indexes that ideally support two queries, or is it better to add a single index that is not perfect for either query, but supports both somewhat?
GROUP BY
can also benefit from ordered indexes.