I have a single reporting table of sales data with about 4 million rows of data:
CREATE TABLE reporting_sales (
customer_id bigint(20) DEFAULT NULL,
effective_date date DEFAULT NULL,
expiration_date date DEFAULT NULL,
license_type_id int(11) DEFAULT NULL,
residency varchar(10) DEFAULT NULL,
gender varchar(10) DEFAULT NULL,
age_range varchar(10) DEFAULT NULL,
KEY ndx_reporting_sales (license_type_id,
effective_date,
expiration_date,
customer_id,
residency,
gender,
age_range) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
;
And this is the statement I want to run to summarize the data as of a particular day:
SELECT COUNT(DISTINCT customer_id),
license_type_id,
residency,
gender,
age_range
FROM tmp_reporting_sales_fl
WHERE license_type_id in (1, 2, 3, 4, 5)
AND effective_date <= '2021-01-01'
AND expiration_date >= '2021-01-01'
GROUP BY license_type_id, residency, gender, age_range
I'm not sure how the index should be structured, specifically with respect to the customer_id field and the grouping.
Here's the explain for the index I have created, as shown above:
id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra |
---|---|---|---|---|---|---|---|---|---|---|---|
1 | SIMPLE | reporting_sales | range | ndx_reporting_sales | ndx_reporting_sales | 4 | 1829784 | 16.66 | Using where; Using index; Using filesort |
How can I improve the performance of this statement and/or what would be a more suitable index?
1 Answer 1
A general rule for multi-column indexes is that you can have N leading columns in the index that are used in equality conditions.
Then you can have one more column in the index after those equality columns, to use for either inequality/range conditions, or grouping, or sorting. But not more than one.
Any further columns are not used for searching, sorting, or grouping. At best, they're used for a covering index.
In the query you show, you have three range conditions. Only one of these conditions can make use of the index.
You can tell from the EXPLAIN report's len
column that it is only using 4 bytes of your index. That's for the first column license_type_id
, which is a 4-byte integer. The other columns of the index are ignored for this query. They don't help narrow down the examined rows, nor do they help the group by.
In the query you show, that's the best you can do.
Possible exception to the above rule: MySQL 8.0.13 implemented the skip scan range access method, which might help in some cases, but there are a lot of limitations. Read the section https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-access-skip-scan for details.
-
1Thank you @bill-karwin, for the explanation. Going to check out your book. So, I should just build the index using whichever of the three range conditions is most selective, for instance, (expiration_date)? Will it accomplish anything to add one of the grouping columns, say (expiration_date, age_range)?Yardboy– Yardboy2023年01月12日 15:26:47 +00:00Commented Jan 12, 2023 at 15:26
-
Thanks for checking out my book (be sure to get the 2022 revision)!Bill Karwin– Bill Karwin2023年01月12日 15:29:12 +00:00Commented Jan 12, 2023 at 15:29
-
Yes, pick the column that is most selective, in other words reduces the examined rows the the smallest number. Note this might change over time. If your query has a range condition, then no further columns of the index can optimize group by or order by.Bill Karwin– Bill Karwin2023年01月12日 15:30:40 +00:00Commented Jan 12, 2023 at 15:30
-
@BillKarwin - All the columns seem to be
NULLable
. So, a key_len of 4 would imply a column of length 3, such as aDATE
.Rick James– Rick James2023年01月12日 20:34:16 +00:00Commented Jan 12, 2023 at 20:34
(license_type_id, effective_date)
or(license_type_id, expiration_date)
(use those index which shows best selectivity, I predict that this will be the latter one). Anycase the rest of your index cannot work for shown query.PRIMARY KEY
?