Good day, I have the following sql server database table: enter image description here
Please note the compound primary key. This was done for 3 reasons:
- Prevent duplicate entries
- Improve query performance as all queries will have all of thoes 3 keys.
- We needed and index, and I did not want to introduce a random ID.
Please also note that this table was designed with its size in mind, This table is going to store millions and millions of rows of data.
OK now for my actual question. I am using azure sql server to host this db. and I have enabled automatic tuning. And strangely enough I see that it then went and created a new index. (see below)
Now In my mind this seems to be a duplicate index, as the same columns are being indexed.
So I now have two indexes on my table:
Original (My PK):
ALTER TABLE [dbo].[SensorDataRaw] ADD CONSTRAINT [PK_SensorDataRaw] PRIMARY KEY CLUSTERED
(
[DateTime] ASC,
[SensorId] ASC,
[Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
Newly Added (Auto Created by azure tuning):
CREATE NONCLUSTERED INDEX [nci_wi_SensorDataRaw_DC9789077DA75B4440AC8BFE3E2AA198] ON [dbo].[SensorDataRaw]
(
[Key] ASC,
[SensorId] ASC,
[DateTime] ASC
)
INCLUDE ( [Value]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO
Observations:
- The order of the coloumns has been reversed in the new index.
- The new index is NOT unique
- the new index includes the value column.
Please note my knowledge on indexes is not advanced, hence me asking this.
So my question is:
- Can someone explain why the newly added index is better than my intially created one.
- How can I remove the two indexes and just create one that covers both cases. With this being such a massive database I cannot afford the space that both these indexes will take up.
- Is the maybe a better design alternative?
Additional Info:
I'm assuming the type of queries becomes important here, So I have listed some examples.
All queries include DateTime
, SensorId
, and Key
.
Simple Queries:
Select SensorId Where average value for key w is greater than x where time between (y,z)
Graphing Data:
SELECT AVG([Value]) AS 'AvgValue',
DATEADD( MINUTE,
(DATEDIFF(MINUTE, '1990-01-01T00:00:00', [dbo].[SensorDataRaw].
[DateTime]) / @IntervalInMinutes) * @IntervalInMinutes,
'1990-01-01T00:00:00'
) AS 'TimeGroup'
FROM [dbo].[SensorDataRaw]
where
[dbo].[SensorDataRaw].[SensorId] = @SensorId
and [dbo].[SensorDataRaw].[Key] = @KeyValue
and [dbo].[SensorDataRaw].[DateTime] Between @DateFrom and @DateTo
and [dbo].[SensorDataRaw].[Value] IS NOT NULL
GROUP BY (DATEDIFF(MINUTE, '1990-01-01T00:00:00', [dbo].[SensorDataRaw].
[DateTime]) / @IntervalInMinutes)
3 Answers 3
The index suggested by the system is a much better fit for the query you have shown. You should aim to have columns with equality predicates as the leading columns.
Consider a phone book ordered by lastname, firstname
. If your rquirement is to find all people with surnames between "Brown" and "Yates" and a first name of "John" then you need to read most of the phone book. If the phonebook was instead ordered by firstname, lastname
you could easily find the "John" section and the first "Brown" in the section then all you need to do is read all the names until the lastname
is after "Yates" or a new firstname is encountered.
It might not be the ideal index. Potentially you should just change the key columns in the clustered index to this order rather than creating a new one though. You need to evaluate this based on knowledge of your workload.
-
2What you have read is incomplete then. Once you have read my answer you will have read something different. Column order is only irrelevant if all of them are in equality predicates.Martin Smith– Martin Smith2017年12月27日 11:34:27 +00:00Commented Dec 27, 2017 at 11:34
-
1The query you have shown isn't searching for exact datetimes. It is searching for a range. And it needs to read all the irrelevant rows for other sensors and keys in that range. An index ordered by
[Key] ,[SensorId] , [DateTime]
can be efficiently navigated to the start of the exact range needed by the query and read exactly the needed rows.Martin Smith– Martin Smith2017年12月27日 11:39:52 +00:00Commented Dec 27, 2017 at 11:39 -
1@sepupic - the exact same hotspot and fragmentation will occur with a nonclustered index on those columns. The proposed NCI is basically a copy of the whole table (just missing one column) - if the CI is basically useless for other queries having a nicely unfragmented CI and the same amount of fragmentation on the NCI is not a benefit.Martin Smith– Martin Smith2017年12月27日 11:47:28 +00:00Commented Dec 27, 2017 at 11:47
-
2@sepupic - yes they can preserve the PK but just re-order the columns - this doesn't affect the uniqueness aspect. But as I say the OP needs to evaluate this against their whole workload (including inserts)Martin Smith– Martin Smith2017年12月27日 11:52:10 +00:00Commented Dec 27, 2017 at 11:52
-
3Having a separate insert point for every (key,sensor) won't lead to bad fragmentation, so long as each (key,sensor) pair has thousands of rows. See, eg blogs.msdn.microsoft.com/dbrowne/2012/06/25/…David Browne - Microsoft– David Browne - Microsoft2017年12月27日 14:14:41 +00:00Commented Dec 27, 2017 at 14:14
Can someone explain why the newly added index is better than my intially created one
Your index is clustered
one, this means it includes all the data as the leaf level.
The index offered by server is narrower respect to clustered as it contains only key fields
+ included field
on the leaf level.
Your query needs only key fields
+ value
field so non clustered index offered you by server is sufficient, and it is chosen by server as the narrowest possible covering index.
How can I remove the two indexes and just create one that covers both cases.
You can remove non clustered index, your clustered is of course covering one, and if it has only 2 additional fields: datetime2(2) + decimal(19,4) respect to non clustered there will no huge overhead when using it instead of non clustered.
-
IMO,you need to revise your answer,in one place you suggest that system index is better and in last para you suggest to remove non cluster index.Also if you can reply to my query which i ask above.KumarHarsh– KumarHarsh2017年12月27日 10:37:43 +00:00Commented Dec 27, 2017 at 10:37
-
2@KumarHarsh If OP wants to keep both indexes why not? But if he asks "How can I remove the two indexes and just create one that covers both cases" the answer is to remove non clustered indexsepupic– sepupic2017年12月27日 10:44:43 +00:00Commented Dec 27, 2017 at 10:44
-
No main question is which one is better of two and if system index is not good then how to remove it ?KumarHarsh– KumarHarsh2017年12月27日 10:53:09 +00:00Commented Dec 27, 2017 at 10:53
-
1System index suits better for this query, where did you see "system index is not good"? But clustered index covers all the queries, not only one, besides, it's PK, so if one of them should be removed, it's non clustered that should be removedsepupic– sepupic2017年12月27日 10:56:01 +00:00Commented Dec 27, 2017 at 10:56
-
3>>>how come system index is good even for given query,see "value" is use in where clause<<< and what? it's used like [Value] IS NOT NULL. There is no difference if this condition is included in query or not, value is used in aggregation (AVG) and NULL values are just ignored. I suppose this condition was added only because they did not want to receive a warning "Null value is eliminated by an aggregate"sepupic– sepupic2017年12月27日 11:14:26 +00:00Commented Dec 27, 2017 at 11:14
The suggested index is narrower, that will make it more efficient when the query only deals with columns covered by that index.
The column order is also different which MAY make it more efficient for that particular query (I can't tell for sure without seeing an execution plan). Column order DOES make a difference.
Which is better, the non-clustered index may be better for that particular query but wont cover everything. Besides, the clustered index can't be removed and I don't see any problem with keeping both.
Try experimenting with the two, force the clustered index and check the execution plan and IO stats, then force the non-clustered and compare.
-
I like the idea to force the index on the query and then check the execution plans for it. With regards to keeping both. There will be a massive storage drawback as data will be duplicated in both indexes.Zapnologica– Zapnologica2017年12月27日 11:39:27 +00:00Commented Dec 27, 2017 at 11:39
-
it'll depend on the size of the table, indexes are always going to be a pay off between overhead and performance benefit. I'd test the two out, see how they perform and then decide if the non-clustered is worth having or not.David Fowler– David Fowler2017年12月27日 11:59:21 +00:00Commented Dec 27, 2017 at 11:59
-
The design spec is that the table will grow to several Terra-bytes. In a case like this, I don't thing the fastest choice will always be the most practical choice. Generally with time series data, Read queries are a lot less frequent than inserts and deletes etc. I think what would be key here, is more of a consistent response time with regards to select queries. With such a large table, one would probably have to add some sort of cache layer on the data in order to achieve blitz fast localized queries.Zapnologica– Zapnologica2017年12月27日 12:04:24 +00:00Commented Dec 27, 2017 at 12:04
Explore related questions
See similar questions with these tags.
DateTimeServer
field is totally irrelevant to this question. Its a field that is there for debugging. It is not used in ANY queries.