MySQL: Improve performance on large table for dynamically built query with composite index correctly
Assuming that I have a MySQL table with ~ 30 million entries and 40 columns I have an highly active query (5 queries/second) which is quite slow (avg. ~ 20 seconds) and has a high number of rows scanned (avg. 50.000 rows). Performance is getting worse and worse with the table growing. I want to solve the problem by adding the correct composite or even covering index.
The doctrine query is built by a dynamic query builder and involves the following properties (only userId is used in any query, all other columns are only used for filtering sometimes):
- Always:
user_id
int with=
[> 1 m users, but single users may have > 200K entries] - Sometimes:
status
varchar(20) withIN()
[7 possibilities] - Sometimes:
expiration_timestamp
datetime with<
[can be any timestamp] - Sometimes:
type
varchar(20)( withIN()
[7 possibilities] - Rare:
name
varchar(255) withLIKE
[with trailing wildcard, rarely repetitive] - Very rare:
tags
varchar(2000) withLIKE
[with leading and trailing wildcard] - Often:
orderBy id int DESC
[id is the primary key, the orderBy is necessary]
Without having tested it (will require a production deployment with maintenance window including short downtime) I would propose the following solution:
CREATE INDEX listing ON items(user_id,status,type,name,expiration_timestamp,id);
Here is my reasoning: First of all, the user_id
is always used with an equality comparison, so this should be first. status
and type
have an IN
clause, therefore they should be second. The third one is name
, because even if LIKE
with trailing wildcard is used it is highly selective. Indexing the expiration_timestamp
will help to significantly reduce the number of results. As MySQL uses indices for ordering it makes sense to put the id
at the end of the composite index. There is no reason to put tags in the index, because an index on a LIKE with a leading wildcard is useless.
Is this the correct approach or would you recommend to improve something here?
One fact I am not sure about furthermore: In case the query e.g. is without type or status, will MySQL be "intelligent" enough to use my composite index anyway? Still quite new to MySQL indexing, thank you for your help!
3 Answers 3
There are several problems with such an index.
All the conditions you describe other than user_id = ?
are considered range conditions. A range condition is any condition that matches multiple values in each case. So <
, IN()
, LIKE
, are all range conditions.
Here's the first problem: in a composite index, only one column involved in a range condition will be used.
Example: Suppose you had an index on (a,b,c) in a hypothetical table.
SELECT ... WHERE a = 1 AND b < 10 AND c IN (1,2,3)
This would use only the (a,b) columns of the index. After the first column used in a range condition, conditions for the subsequent column(s) of the index would need to be evaluated "the hard way," row by row.
Actually, there's one mitigation for that, which is index condition pushdown. This happens automatically. But this isn't as good as a real index lookup.
A second problem is that the columns used in the index must be consecutive. If you try to "skip" a column, it can't use the column from the index.
Example:
SELECT ... WHERE a = 1 AND c IN (1,2,3)
I said you can have one column in addition to the columns used for equality, which this example query appears to satisfy. But if the index is on columns (a,b,c), but no condition on b is in this query, then the condition on column c can't use the index either.
A third problem is that ORDER BY optimization is also spoiled by having any range condition in the query. That is, once the query does a range condition, the sort order isn't implicit from the index order.
So the bottom line is that given your dynamic queries, with a mix of different conditions which may or may not be included in a given run, you can't make a single composite index that will satisfy all cases.
What you could do is create several composite indexes:
(user_id, status)
(user_id, expiration_timestamp)
(user_id, type)
(user_id, name)
Then let the optimizer pick the one that is most relevant to a given query, based on the dynamic conditions included.
But regardless, ORDER BY id
will require a filesort.
- All "
=
" columns first --user_id
. - Then some or all
IN
columns -- If the SELECThas only one item in the
IN[How often does that happen?], the Optimizer that into an
=`. Also, there may be cases where the Optimizer can hop through the index. - Then any range tests:
BETWEEN
, '<',LIKE
without leading wildcard, etc. --expiration_timestamp
Since the queries are dynamically created, it behooves you to tack extra columns on (within limits). But also it is not wise to have any indexes matching exactly all of the first few columns of another one. So, get rid of INDEX(user_id)
.
To handle the single-valued-IN cases, I will add to Bill's suggestions:
(user_id, status, expiration_timestamp)
(user_id, expiration_timestamp)
(user_id, type, expiration_timestamp)
(user_id, name)
As for ENUM
(1 byte) for status
and type
; it is worth considering since it would save a GB or so of disk space (between the data and each index with that column). It won't appreciably change the performance of any index.
Turn on the slowlog with a suitable value of long_query_time
. Later check the slowlog to see which combinations of columns lead to the slowest queries. Then cautiously add more composite indexes.
An anomaly you might encounter. With ORDER BY id DESC
, some queries will be able to avoid the "filesort".
The PRIMARY KEY
is silently tacked on the end of each secondary key. So, if you have a WHERE clause with only =
tests, the id
on the end will obviate the filesort.
I'd like to add that it's extremely helpful to have some to-go-to tool for general analysis. I'm using JetProfiler, but there are other alternatives such as Percona Toolkit.
On several occasions I've been able to find some particular set of queries that surprisingly were the actual culprits rather than the queries I expected to be the problem.
Explore related questions
See similar questions with these tags.
user_id
to locate any subset of that small size should be quick. As Barry mentioned, use theEXPLAIN ANALYZE
to understand all of your bottlenecks so you can decide how to appropriately tune your indexes / queries / table design. It's possible multiple smaller indexes rather than a single large composite index, as you mentioned in your post, would be best.