How to structure an index to support counting distinct values of one column and grouping by several others?

Question 1

I have a single reporting table of sales data with about 4 million rows of data:

CREATE TABLE reporting_sales (
 customer_id bigint(20) DEFAULT NULL,
 effective_date date DEFAULT NULL,
 expiration_date date DEFAULT NULL,
 license_type_id int(11) DEFAULT NULL,
 residency varchar(10) DEFAULT NULL,
 gender varchar(10) DEFAULT NULL,
 age_range varchar(10) DEFAULT NULL,
 KEY ndx_reporting_sales (license_type_id,
 effective_date,
 expiration_date,
 customer_id,
 residency,
 gender,
 age_range) USING BTREE
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4
;

And this is the statement I want to run to summarize the data as of a particular day:

SELECT COUNT(DISTINCT customer_id),
 license_type_id,
 residency,
 gender,
 age_range
FROM tmp_reporting_sales_fl
WHERE license_type_id in (1, 2, 3, 4, 5)
 AND effective_date <= '2021-01-01'
 AND expiration_date >= '2021-01-01'
GROUP BY license_type_id, residency, gender, age_range

I'm not sure how the index should be structured, specifically with respect to the customer_id field and the grouping.

Here's the explain for the index I have created, as shown above:

id	select_type	table	partitions	type	possible_keys	key	key_len	ref	rows	filtered	Extra
1	SIMPLE	reporting_sales	range	ndx_reporting_sales	ndx_reporting_sales	4	1829784	16.66	Using where; Using index; Using filesort

How can I improve the performance of this statement and/or what would be a more suitable index?

Question 2

Use the index by either (license_type_id, effective_date) or (license_type_id, expiration_date) (use those index which shows best selectivity, I predict that this will be the latter one). Anycase the rest of your index cannot work for shown query.

Question 3

What's the PRIMARY KEY?

Question 4

A general rule for multi-column indexes is that you can have N leading columns in the index that are used in equality conditions.

Then you can have one more column in the index after those equality columns, to use for either inequality/range conditions, or grouping, or sorting. But not more than one.

Any further columns are not used for searching, sorting, or grouping. At best, they're used for a covering index.

In the query you show, you have three range conditions. Only one of these conditions can make use of the index.

You can tell from the EXPLAIN report's len column that it is only using 4 bytes of your index. That's for the first column license_type_id, which is a 4-byte integer. The other columns of the index are ignored for this query. They don't help narrow down the examined rows, nor do they help the group by.

In the query you show, that's the best you can do.

Possible exception to the above rule: MySQL 8.0.13 implemented the skip scan range access method, which might help in some cases, but there are a lot of limitations. Read the section https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-access-skip-scan for details.

Question 5

Thank you @bill-karwin, for the explanation. Going to check out your book. So, I should just build the index using whichever of the three range conditions is most selective, for instance, (expiration_date)? Will it accomplish anything to add one of the grouping columns, say (expiration_date, age_range)?

Question 6

Thanks for checking out my book (be sure to get the 2022 revision)!

Question 7

Yes, pick the column that is most selective, in other words reduces the examined rows the the smallest number. Note this might change over time. If your query has a range condition, then no further columns of the index can optimize group by or order by.

Question 8

@BillKarwin - All the columns seem to be NULLable. So, a key_len of 4 would imply a column of length 3, such as a DATE.

Bill Karwin Bill Karwin 17.2k3 gold badges32 silver badges45 bronze badges · Accepted Answer · 2023-01-12 14:35:40Z

A general rule for multi-column indexes is that you can have N leading columns in the index that are used in equality conditions.

Then you can have one more column in the index after those equality columns, to use for either inequality/range conditions, or grouping, or sorting. But not more than one.

Any further columns are not used for searching, sorting, or grouping. At best, they're used for a covering index.

In the query you show, you have three range conditions. Only one of these conditions can make use of the index.

You can tell from the EXPLAIN report's len column that it is only using 4 bytes of your index. That's for the first column license_type_id, which is a 4-byte integer. The other columns of the index are ignored for this query. They don't help narrow down the examined rows, nor do they help the group by.

In the query you show, that's the best you can do.

Possible exception to the above rule: MySQL 8.0.13 implemented the skip scan range access method, which might help in some cases, but there are a lot of limitations. Read the section https://dev.mysql.com/doc/refman/8.0/en/range-optimization.html#range-access-skip-scan for details.

Thank you @bill-karwin, for the explanation. Going to check out your book. So, I should just build the index using whichever of the three range conditions is most selective, for instance, (expiration_date)? Will it accomplish anything to add one of the grouping columns, say (expiration_date, age_range)?
Thanks for checking out my book (be sure to get the 2022 revision)!
Yes, pick the column that is most selective, in other words reduces the examined rows the the smallest number. Note this might change over time. If your query has a range condition, then no further columns of the index can optimize group by or order by.
@BillKarwin - All the columns seem to be NULLable. So, a key_len of 4 would imply a column of length 3, such as a DATE.

Stack Exchange Network

How to structure an index to support counting distinct values of one column and grouping by several others?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to structure an index to support counting distinct values of one column and grouping by several others?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions