Index for date column along with multiple other columns in where clause

Question 1

I have a 'joinings' table which lists the joining date of all the employees along with their office location, department no and the job advertisement through which they were selected. I want to be able to query this table by their joining dates and group them using one of the other columns. I also want to be able to apply a where clause on any of the columns.

It could be like this -

SELECT month(joining_date) as month, count(*) as entries
 , branch_name 
FROM joinings 
WHERE joining_date >= '2017-01-01' 
 AND joining_date <= '2017-05-31'
 AND department_no in (5,4,7) 
 AND source_of_appliction IN ('glassdoor', 'linkedin') 
GROUP BY month(joining_date), branch_name

OR could be like this -

SELECT year(joining_date) as month, count(*) as entries
 , source_of_application 
FROM joinings 
WHERE joining_date >= '2017-01-01' 
 AND joining_date <= '2019-12-31' 
 AND department_no IN (5,4,7) 
 AND branch_name IN ('ZEIT','DUS') 
GROUP BY year(joining_date), source_of_application

This table can contain thousands of records for several years. The range of the joining date that I would query could be between a single month, range of months or range of years.

I would like to know what indexes need to be created to get me optimized performance for my select queries. If not the exact indexes, I am at least looking out for pointers to get me started to create the right indices.

What I have- I currently have a multi-column index created on joinings(joining_date, department_no), joinings(joining_date, branch_name), joinings(joining_date, source_of_application) and also have individual indexes for all the single columns. But my select performs a full table scan for the queries that I have listed above.

Question 2

If you add create table statement as well as explain to your question it will be easier to help you.

Question 3

By using IN, <= and >= you are not helping the optimizer to decide to use indexes especially if the table uses not that many blocks on the disk. Using many indexes also slows down insert, delete, and update of the rows in the table.

Question 4

I am using the <= and >= operators for specifying the date range since I have a date column. And as you said, I do not intend to use these many indexes. These were just what I tried as I was just trying to find which ones work the best for the queries that I plan to have.

Question 5

@Lennart I am using a laravel migration to create my table. Are there any particular details that I must post that will help you get a better understanding.

Question 6

This is a "range: joining_date >= '2017-01-01' AND joining_date <= '2017-05-31'

This is sort of a range: department_no IN (5,4,7) . There are cases where it acts efficiently like =, there are cases where it should be considered a "range".

This is an =: department_no IN (5) . That is, the Optimizer turns it into department_no = 5, which has much more optimization potential.

http://mysql.rjweb.org/doc.php/index_cookbook_mysql says to put range columns last, not first in an index.

In your queries, it is hard to predict what is optimal.

If you are likely to sometimes have single-item INs, then you need

INDEX(department_no, joining_date) -- in THIS order

Also, newer versions of MySQL may work well with that index for the multi-item IN by leapfrogging across the table. So add that index.

As for

INDEX(joining_date, ...)

When using a 'range', that index will not get past joining_date to whatever other column(s) follow it. So you may as say just

INDEX(joining_date)

As for the GROUP BY... Neither of your queries has any hope of using an index for the GROUP BY. So there will be some form of extra effort after the WHERE -- a tmp table, a sort, something.

Hiding a column in a function month(joining_date) -- the Optimizer can't use the joining_date column an any index for this.
The WHERE left the results in some messed up order (assuming a multi-item IN)

Unrelated... I like this pattern:

AND joining_date >= '2017-01-01' 
AND joining_date < '2017-01-01' + INTERVAL 1 MONTH

It avoids computing the end, leap year, DATE vs DATETIME vs DATETIME(6) issues, etc.

Bottom line: Have this one index:

INDEX(branch_name, department_no, joining_date)

Then get the EXPLAIN SELECT ... to see if it did the leapfrogging. If that fails, let's rethink.

Question 7

You probably don't have a index problem so much as you have an optimizer problem. I would consider looking into your database's statistics and rewriting your query with the optimizer in mind.

It's important to remember that a query describes the logical result set you want to arrive at but the optimizer has multiple pathways to get there in terms of physical operations. You and I see one query above but the optimizer might see several operations to get there. The optimizer chooses which pathway to take depending on database statistics, available resources, and what result sets the optimizer is able to anticipate.

Certain operators like>=, <=, and IN can make it difficult for the optimizer to anticipate what's going to be most efficient to run against these WHERE clauses and this is likely what's leading to the optimizer failing to use your indexes.

There various ways to push the optimizer in various directions but the right choice is highly dependent on the qualities of the base data set you're working with. Some things that have worked for me in the past include breaking up a query into smaller discrete steps especially when I know I'm significantly reducing the size of the data set that the rest of the query will have to consider. Looking at your query ask yourself (or even better, test) which where clauses returns the smallest counts and return that set first before asking the query to run the rest of the operations. Depending on your method and system doing so can effectively get the optimizer to "see" the resulting subset for that part of the query then make better choices on how to finish processing efficiently.

Question 8

"breaking into smaller steps" is almost always folly -- the overhead of the extra round-trips to the server is likely to ought-weigh the benefits.

Question 9

Generally speaking I agree with you but I have run into scenarios where it works out better primarily when the specific data sets are statistically skewed in some abnormal way that really throw the optimizer off. I would consider those scenarios to be the exception and not the rule so your point about the problem of risking increased overhead is definitely something to keep in mind.

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Answer 1 · 2017-07-20 18:13:29Z

This is a "range: joining_date >= '2017-01-01' AND joining_date <= '2017-05-31'

This is sort of a range: department_no IN (5,4,7) . There are cases where it acts efficiently like =, there are cases where it should be considered a "range".

This is an =: department_no IN (5) . That is, the Optimizer turns it into department_no = 5, which has much more optimization potential.

http://mysql.rjweb.org/doc.php/index_cookbook_mysql says to put range columns last, not first in an index.

In your queries, it is hard to predict what is optimal.

If you are likely to sometimes have single-item INs, then you need

INDEX(department_no, joining_date) -- in THIS order

Also, newer versions of MySQL may work well with that index for the multi-item IN by leapfrogging across the table. So add that index.

As for

INDEX(joining_date, ...)

When using a 'range', that index will not get past joining_date to whatever other column(s) follow it. So you may as say just

INDEX(joining_date)

As for the GROUP BY... Neither of your queries has any hope of using an index for the GROUP BY. So there will be some form of extra effort after the WHERE -- a tmp table, a sort, something.

Hiding a column in a function month(joining_date) -- the Optimizer can't use the joining_date column an any index for this.
The WHERE left the results in some messed up order (assuming a multi-item IN)

Unrelated... I like this pattern:

AND joining_date >= '2017-01-01' 
AND joining_date < '2017-01-01' + INTERVAL 1 MONTH

It avoids computing the end, leap year, DATE vs DATETIME vs DATETIME(6) issues, etc.

Bottom line: Have this one index:

INDEX(branch_name, department_no, joining_date)

Then get the EXPLAIN SELECT ... to see if it did the leapfrogging. If that fails, let's rethink.

MMartinez MMartinez 235 bronze badges · Answer 2 · 2017-07-20 16:25:50Z

You probably don't have a index problem so much as you have an optimizer problem. I would consider looking into your database's statistics and rewriting your query with the optimizer in mind.

It's important to remember that a query describes the logical result set you want to arrive at but the optimizer has multiple pathways to get there in terms of physical operations. You and I see one query above but the optimizer might see several operations to get there. The optimizer chooses which pathway to take depending on database statistics, available resources, and what result sets the optimizer is able to anticipate.

Certain operators like>=, <=, and IN can make it difficult for the optimizer to anticipate what's going to be most efficient to run against these WHERE clauses and this is likely what's leading to the optimizer failing to use your indexes.

There various ways to push the optimizer in various directions but the right choice is highly dependent on the qualities of the base data set you're working with. Some things that have worked for me in the past include breaking up a query into smaller discrete steps especially when I know I'm significantly reducing the size of the data set that the rest of the query will have to consider. Looking at your query ask yourself (or even better, test) which where clauses returns the smallest counts and return that set first before asking the query to run the rest of the operations. Depending on your method and system doing so can effectively get the optimizer to "see" the resulting subset for that part of the query then make better choices on how to finish processing efficiently.

"breaking into smaller steps" is almost always folly -- the overhead of the extra round-trips to the server is likely to ought-weigh the benefits.
Generally speaking I agree with you but I have run into scenarios where it works out better primarily when the specific data sets are statistically skewed in some abnormal way that really throw the optimizer off. I would consider those scenarios to be the exception and not the rule so your point about the problem of risking increased overhead is definitely something to keep in mind.

Stack Exchange Network

Index for date column along with multiple other columns in where clause

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Index for date column along with multiple other columns in where clause

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions