I have a 'joinings' table which lists the joining date of all the employees along with their office location, department no and the job advertisement through which they were selected. I want to be able to query this table by their joining dates and group them using one of the other columns. I also want to be able to apply a where clause on any of the columns.
It could be like this -
SELECT month(joining_date) as month, count(*) as entries
, branch_name
FROM joinings
WHERE joining_date >= '2017-01-01'
AND joining_date <= '2017-05-31'
AND department_no in (5,4,7)
AND source_of_appliction IN ('glassdoor', 'linkedin')
GROUP BY month(joining_date), branch_name
OR could be like this -
SELECT year(joining_date) as month, count(*) as entries
, source_of_application
FROM joinings
WHERE joining_date >= '2017-01-01'
AND joining_date <= '2019-12-31'
AND department_no IN (5,4,7)
AND branch_name IN ('ZEIT','DUS')
GROUP BY year(joining_date), source_of_application
This table can contain thousands of records for several years. The range of the joining date that I would query could be between a single month, range of months or range of years.
I would like to know what indexes need to be created to get me optimized performance for my select queries. If not the exact indexes, I am at least looking out for pointers to get me started to create the right indices.
What I have-
I currently have a multi-column index created on joinings(joining_date, department_no)
, joinings(joining_date, branch_name)
, joinings(joining_date, source_of_application)
and also have individual indexes for all the single columns. But my select performs a full table scan for the queries that I have listed above.
2 Answers 2
This is a "range: joining_date >= '2017-01-01' AND joining_date <= '2017-05-31'
This is sort of a range: department_no IN (5,4,7)
. There are cases where it acts efficiently like =
, there are cases where it should be considered a "range".
This is an =
: department_no IN (5)
. That is, the Optimizer turns it into department_no = 5
, which has much more optimization potential.
http://mysql.rjweb.org/doc.php/index_cookbook_mysql says to put range columns last, not first in an index.
In your queries, it is hard to predict what is optimal.
If you are likely to sometimes have single-item INs
, then you need
INDEX(department_no, joining_date) -- in THIS order
Also, newer versions of MySQL may work well with that index for the multi-item IN
by leapfrogging across the table. So add that index.
As for
INDEX(joining_date, ...)
When using a 'range', that index will not get past joining_date
to whatever other column(s) follow it. So you may as say just
INDEX(joining_date)
As for the GROUP BY
... Neither of your queries has any hope of using an index for the GROUP BY
. So there will be some form of extra effort after the WHERE
-- a tmp table, a sort, something.
- Hiding a column in a function
month(joining_date)
-- the Optimizer can't use thejoining_date
column an any index for this. - The
WHERE
left the results in some messed up order (assuming a multi-itemIN
)
Unrelated... I like this pattern:
AND joining_date >= '2017-01-01'
AND joining_date < '2017-01-01' + INTERVAL 1 MONTH
It avoids computing the end, leap year, DATE vs DATETIME vs DATETIME(6) issues, etc.
Bottom line: Have this one index:
INDEX(branch_name, department_no, joining_date)
Then get the EXPLAIN SELECT ...
to see if it did the leapfrogging. If that fails, let's rethink.
You probably don't have a index problem so much as you have an optimizer problem. I would consider looking into your database's statistics and rewriting your query with the optimizer in mind.
It's important to remember that a query describes the logical result set you want to arrive at but the optimizer has multiple pathways to get there in terms of physical operations. You and I see one query above but the optimizer might see several operations to get there. The optimizer chooses which pathway to take depending on database statistics, available resources, and what result sets the optimizer is able to anticipate.
Certain operators like>=, <=, and IN can make it difficult for the optimizer to anticipate what's going to be most efficient to run against these WHERE clauses and this is likely what's leading to the optimizer failing to use your indexes.
There various ways to push the optimizer in various directions but the right choice is highly dependent on the qualities of the base data set you're working with. Some things that have worked for me in the past include breaking up a query into smaller discrete steps especially when I know I'm significantly reducing the size of the data set that the rest of the query will have to consider. Looking at your query ask yourself (or even better, test) which where clauses returns the smallest counts and return that set first before asking the query to run the rest of the operations. Depending on your method and system doing so can effectively get the optimizer to "see" the resulting subset for that part of the query then make better choices on how to finish processing efficiently.
-
"breaking into smaller steps" is almost always folly -- the overhead of the extra round-trips to the server is likely to ought-weigh the benefits.Rick James– Rick James2017年07月20日 18:14:50 +00:00Commented Jul 20, 2017 at 18:14
-
Generally speaking I agree with you but I have run into scenarios where it works out better primarily when the specific data sets are statistically skewed in some abnormal way that really throw the optimizer off. I would consider those scenarios to be the exception and not the rule so your point about the problem of risking increased overhead is definitely something to keep in mind.MMartinez– MMartinez2017年07月20日 23:46:13 +00:00Commented Jul 20, 2017 at 23:46
IN
,<=
and>=
you are not helping the optimizer to decide to use indexes especially if the table uses not that many blocks on the disk. Using many indexes also slows down insert, delete, and update of the rows in the table.