MySQL: Optimize query with a range and distinct count

Question 1

I have a query on a large table with millions of rows that looks like this:

SELECT
 COUNT(
 DISTINCT clicks.tx_id
 ) AS unique_clicks_count ,
 clicks.offer_id
FROM
 clicks
WHERE
 clicks.offer_id = 5
 AND created_at > '2014-11-27 18:00:02'
;

Created_at is a timestamp. I have a compound index on (offer_id, created_at) which gets used. The following is the explain:

| 1 | SIMPLE | clicks | range | clicks_created_at_index,clicks_offer_id_created_at_index | clicks_offer_id_created_at_index | 8 | NULL | 215380 | Using index condition; Using MRR |

Keeping in mind the range, what kind of index would I need to be able to count the distinct tx_id's efficiently, most likely which covers tx_id as well?
What would the index look like without specifying clicks.offer_id = 5, and instead doing GROUP BY offer_id?

Question 2

Did you try with an index on (offer_id, created_at, tx_id)?

Question 3

I did not. I thought that MySQL cannot use any columns to the right of a range column? In this case, created_at being the range column.

Question 4

It can't use them super efficiently if they were part of the WHERE condition but it can use them. And that means skip reading the table and reading only the index.

Question 5

Oh, interesting. From cases like this where it's just reading the index, does the order matter? For example, if I wanted to read another 1 or 2 columns, does it matter in which order they are in at the end? And lastly, do you know the appropriate link in the MySQL docs that explains this? I will try your solution out soon and report back.

Question 6

It might be worth partitioning the table on offer_id and then adding an index on created_at, tx_id. That should give you both the range scan and the filter on the value.

Question 7

You have the best index there is. It is in the right order, and the EXPLAIN says "Using index", which means that it read the index to get the answer, and did not have to reach into the data.

(To further address all the comments...)

Note that it needed to read about 200K rows (of the index) to do the count. That many rows takes time.

INDEX(offer_id, created_at) versus INDEX(offer_id, created_at, tx_id) -- Apparently you are using InnoDB and tx_id is the PRIMARY KEY. The PK is included in every secondary key, so these two index specifications are virtually identical.

Order of the columns in an INDEX usually matters. And it does matter here. The fields must be in this order: (1) all the the "=" conditions (offer_id), (2) one range (created_id), and (3) all the other fields to make it "Using index", in any order (tx_id).

If you did not have offer_id = 5, follow the above pattern and get (1) (empty set), (2) (created_id), and (3) (tx_id) -- That is, INDEX(created_at, tx_id). Note that neither index works well for the other query.

No kind of PARTITIONing would help performance at all. You don't need a 2-dimensional index (as in two ranges); you have "=" and "range", so a 'compound index' works best.

I suspect that "Using MRR" (Multi-Range Read Optimization) effectively replaces "Using temporary" and "Using filesort" would might normally be used for DISTINCT.

Question 8

How do you know that the engine is InnoDB and that tx_id is the primary key? (Mind that if it is the PK, then the DISTINCT is redundant.)

Question 9

If Using index condition; Using MRR shows something is that the tx_id is not part of the index (so, it cannot be the PK or part of the PK.) If it was, the Extra would show Using index.

Question 10

Thanks for the explanation. tx_id is not a primary nor unique key. "No kind of PARTITIONing would help performance at all." Seeing as a lot of queries use the created_at timestamp column, I'm considering partitioning on that column. Why would that not help?

Question 11

There are two ways do drill down to find the row(s) you need -- INDEX and PARTITION. If the INDEX will do the job, then PARTITION does not do it any better. If you did PARTITION BY RANGE(timestamp), the query would (1) prune down to the partition(s) in question, then use the index for the rest of the filtering. I like to invoke "count the disk hits" -- either way, the same number of blocks need to be touched to find the rows in question.

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 1 · 2015-02-11 01:44:24Z

You have the best index there is. It is in the right order, and the EXPLAIN says "Using index", which means that it read the index to get the answer, and did not have to reach into the data.

(To further address all the comments...)

Note that it needed to read about 200K rows (of the index) to do the count. That many rows takes time.

INDEX(offer_id, created_at) versus INDEX(offer_id, created_at, tx_id) -- Apparently you are using InnoDB and tx_id is the PRIMARY KEY. The PK is included in every secondary key, so these two index specifications are virtually identical.

Order of the columns in an INDEX usually matters. And it does matter here. The fields must be in this order: (1) all the the "=" conditions (offer_id), (2) one range (created_id), and (3) all the other fields to make it "Using index", in any order (tx_id).

If you did not have offer_id = 5, follow the above pattern and get (1) (empty set), (2) (created_id), and (3) (tx_id) -- That is, INDEX(created_at, tx_id). Note that neither index works well for the other query.

No kind of PARTITIONing would help performance at all. You don't need a 2-dimensional index (as in two ranges); you have "=" and "range", so a 'compound index' works best.

I suspect that "Using MRR" (Multi-Range Read Optimization) effectively replaces "Using temporary" and "Using filesort" would might normally be used for DISTINCT.

How do you know that the engine is InnoDB and that tx_id is the primary key? (Mind that if it is the PK, then the DISTINCT is redundant.)
If Using index condition; Using MRR shows something is that the tx_id is not part of the index (so, it cannot be the PK or part of the PK.) If it was, the Extra would show Using index.
Thanks for the explanation. tx_id is not a primary nor unique key. "No kind of PARTITIONing would help performance at all." Seeing as a lot of queries use the created_at timestamp column, I'm considering partitioning on that column. Why would that not help?
There are two ways do drill down to find the row(s) you need -- INDEX and PARTITION. If the INDEX will do the job, then PARTITION does not do it any better. If you did PARTITION BY RANGE(timestamp), the query would (1) prune down to the partition(s) in question, then use the index for the rest of the filtering. I like to invoke "count the disk hits" -- either way, the same number of blocks need to be touched to find the rows in question.

Stack Exchange Network

MySQL: Optimize query with a range and distinct count

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

MySQL: Optimize query with a range and distinct count

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions