Questionable duplicate index suggested by optimization

Question 1

Good day, I have the following sql server database table: enter image description here

Please note the compound primary key. This was done for 3 reasons:

Prevent duplicate entries
Improve query performance as all queries will have all of thoes 3 keys.
We needed and index, and I did not want to introduce a random ID.

Please also note that this table was designed with its size in mind, This table is going to store millions and millions of rows of data.

OK now for my actual question. I am using azure sql server to host this db. and I have enabled automatic tuning. And strangely enough I see that it then went and created a new index. (see below)

enter image description here

Now In my mind this seems to be a duplicate index, as the same columns are being indexed.

So I now have two indexes on my table:

Original (My PK):

ALTER TABLE [dbo].[SensorDataRaw] ADD CONSTRAINT [PK_SensorDataRaw] PRIMARY KEY CLUSTERED 
(
 [DateTime] ASC,
 [SensorId] ASC,
 [Key] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Newly Added (Auto Created by azure tuning):

CREATE NONCLUSTERED INDEX [nci_wi_SensorDataRaw_DC9789077DA75B4440AC8BFE3E2AA198] ON [dbo].[SensorDataRaw]
(
 [Key] ASC,
 [SensorId] ASC,
 [DateTime] ASC
)
INCLUDE ( [Value]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON)
GO

Observations:

The order of the coloumns has been reversed in the new index.
The new index is NOT unique
the new index includes the value column.

Please note my knowledge on indexes is not advanced, hence me asking this.

So my question is:

Can someone explain why the newly added index is better than my intially created one.
How can I remove the two indexes and just create one that covers both cases. With this being such a massive database I cannot afford the space that both these indexes will take up.
Is the maybe a better design alternative?

Additional Info:

I'm assuming the type of queries becomes important here, So I have listed some examples.

All queries include DateTime, SensorId, and Key.

Simple Queries:

Select SensorId Where average value for key w is greater than x where time between (y,z)

Graphing Data:

SELECT AVG([Value]) AS 'AvgValue', 
 DATEADD( MINUTE, 
 (DATEDIFF(MINUTE, '1990-01-01T00:00:00', [dbo].[SensorDataRaw].
[DateTime]) / @IntervalInMinutes) * @IntervalInMinutes,
 '1990-01-01T00:00:00'
 ) AS 'TimeGroup'
 FROM [dbo].[SensorDataRaw]
 where 
 [dbo].[SensorDataRaw].[SensorId] = @SensorId
 and [dbo].[SensorDataRaw].[Key] = @KeyValue
 and [dbo].[SensorDataRaw].[DateTime] Between @DateFrom and @DateTo
 and [dbo].[SensorDataRaw].[Value] IS NOT NULL 
 GROUP BY (DATEDIFF(MINUTE, '1990-01-01T00:00:00', [dbo].[SensorDataRaw].
 [DateTime]) / @IntervalInMinutes)

Question 2

Newly created index is not better than your own PK.Can't say why index is auto created and how to remove it.IMO create non clustered index on Value column and include DatetimeServer.Anyway DATEDIFF is not sargable.If you have created table keeping millions of rows in mind then why not store DateDiff logic in one column while populating.not much code is required for this.

Question 3

The index it suggests is better than your original for the shown query. You want the columns matched by equality predicates first.

Question 4

@MartinSmith, it is not clear to me,please explain.when I already have CI in those columns then why I will need NON CI in same columns. Or do you mean Create CI in some other column or do you mean CI is not require in this table at all.Also notice "Value" is use in where clause so you are suggesting it to include in NON CI.

Question 5

@MartinSmith does the order of the columns matter, if all 3 columns are included in the query?

Question 6

@KumarHarsh the DateTimeServer field is totally irrelevant to this question. Its a field that is there for debugging. It is not used in ANY queries.

Question 7

The index suggested by the system is a much better fit for the query you have shown. You should aim to have columns with equality predicates as the leading columns.

Consider a phone book ordered by lastname, firstname. If your rquirement is to find all people with surnames between "Brown" and "Yates" and a first name of "John" then you need to read most of the phone book. If the phonebook was instead ordered by firstname, lastname you could easily find the "John" section and the first "Brown" in the section then all you need to do is read all the names until the lastname is after "Yates" or a new firstname is encountered.

It might not be the ideal index. Potentially you should just change the key columns in the clustered index to this order rather than creating a new one though. You need to evaluate this based on knowledge of your workload.

Question 8

What you have read is incomplete then. Once you have read my answer you will have read something different. Column order is only irrelevant if all of them are in equality predicates.

Question 9

The query you have shown isn't searching for exact datetimes. It is searching for a range. And it needs to read all the irrelevant rows for other sensors and keys in that range. An index ordered by [Key] ,[SensorId] , [DateTime] can be efficiently navigated to the start of the exact range needed by the query and read exactly the needed rows.

Question 10

@sepupic - the exact same hotspot and fragmentation will occur with a nonclustered index on those columns. The proposed NCI is basically a copy of the whole table (just missing one column) - if the CI is basically useless for other queries having a nicely unfragmented CI and the same amount of fragmentation on the NCI is not a benefit.

Question 11

@sepupic - yes they can preserve the PK but just re-order the columns - this doesn't affect the uniqueness aspect. But as I say the OP needs to evaluate this against their whole workload (including inserts)

Question 12

Having a separate insert point for every (key,sensor) won't lead to bad fragmentation, so long as each (key,sensor) pair has thousands of rows. See, eg blogs.msdn.microsoft.com/dbrowne/2012/06/25/…

Question 13

Can someone explain why the newly added index is better than my intially created one

Your index is clustered one, this means it includes all the data as the leaf level.

The index offered by server is narrower respect to clustered as it contains only key fields + included field on the leaf level.

Your query needs only key fields + value field so non clustered index offered you by server is sufficient, and it is chosen by server as the narrowest possible covering index.

How can I remove the two indexes and just create one that covers both cases.

You can remove non clustered index, your clustered is of course covering one, and if it has only 2 additional fields: datetime2(2) + decimal(19,4) respect to non clustered there will no huge overhead when using it instead of non clustered.

Question 14

IMO,you need to revise your answer,in one place you suggest that system index is better and in last para you suggest to remove non cluster index.Also if you can reply to my query which i ask above.

Question 15

@KumarHarsh If OP wants to keep both indexes why not? But if he asks "How can I remove the two indexes and just create one that covers both cases" the answer is to remove non clustered index

Question 16

No main question is which one is better of two and if system index is not good then how to remove it ?

Question 17

System index suits better for this query, where did you see "system index is not good"? But clustered index covers all the queries, not only one, besides, it's PK, so if one of them should be removed, it's non clustered that should be removed

Question 18

>>>how come system index is good even for given query,see "value" is use in where clause<<< and what? it's used like [Value] IS NOT NULL. There is no difference if this condition is included in query or not, value is used in aggregation (AVG) and NULL values are just ignored. I suppose this condition was added only because they did not want to receive a warning "Null value is eliminated by an aggregate"

Question 19

The suggested index is narrower, that will make it more efficient when the query only deals with columns covered by that index.

The column order is also different which MAY make it more efficient for that particular query (I can't tell for sure without seeing an execution plan). Column order DOES make a difference.

Which is better, the non-clustered index may be better for that particular query but wont cover everything. Besides, the clustered index can't be removed and I don't see any problem with keeping both.

Try experimenting with the two, force the clustered index and check the execution plan and IO stats, then force the non-clustered and compare.

Question 20

I like the idea to force the index on the query and then check the execution plans for it. With regards to keeping both. There will be a massive storage drawback as data will be duplicated in both indexes.

Question 21

it'll depend on the size of the table, indexes are always going to be a pay off between overhead and performance benefit. I'd test the two out, see how they perform and then decide if the non-clustered is worth having or not.

Question 22

The design spec is that the table will grow to several Terra-bytes. In a case like this, I don't thing the fastest choice will always be the most practical choice. Generally with time series data, Read queries are a lot less frequent than inserts and deletes etc. I think what would be key here, is more of a consistent response time with regards to select queries. With such a large table, one would probably have to add some sort of cache layer on the data in order to achieve blitz fast localized queries.

Martin Smith Martin Smith 88.4k15 gold badges258 silver badges357 bronze badges · Accepted Answer · 2017-12-27 11:31:22Z

6

The index suggested by the system is a much better fit for the query you have shown. You should aim to have columns with equality predicates as the leading columns.

Consider a phone book ordered by lastname, firstname. If your rquirement is to find all people with surnames between "Brown" and "Yates" and a first name of "John" then you need to read most of the phone book. If the phonebook was instead ordered by firstname, lastname you could easily find the "John" section and the first "Brown" in the section then all you need to do is read all the names until the lastname is after "Yates" or a new firstname is encountered.

It might not be the ideal index. Potentially you should just change the key columns in the clustered index to this order rather than creating a new one though. You need to evaluate this based on knowledge of your workload.

Share

Improve this answer

answered Dec 27, 2017 at 11:31

Martin Smith's user avatar

Martin Smith Martin Smith

88.4k15 gold badges258 silver badges357 bronze badges

13

2

What you have read is incomplete then. Once you have read my answer you will have read something different. Column order is only irrelevant if all of them are in equality predicates.

Martin Smith
– Martin Smith

2017年12月27日 11:34:27 +00:00
Commented Dec 27, 2017 at 11:34
1

The query you have shown isn't searching for exact datetimes. It is searching for a range. And it needs to read all the irrelevant rows for other sensors and keys in that range. An index ordered by [Key] ,[SensorId] , [DateTime] can be efficiently navigated to the start of the exact range needed by the query and read exactly the needed rows.

Martin Smith
– Martin Smith

2017年12月27日 11:39:52 +00:00
Commented Dec 27, 2017 at 11:39
1

@sepupic - the exact same hotspot and fragmentation will occur with a nonclustered index on those columns. The proposed NCI is basically a copy of the whole table (just missing one column) - if the CI is basically useless for other queries having a nicely unfragmented CI and the same amount of fragmentation on the NCI is not a benefit.

Martin Smith
– Martin Smith

2017年12月27日 11:47:28 +00:00
Commented Dec 27, 2017 at 11:47
2

@sepupic - yes they can preserve the PK but just re-order the columns - this doesn't affect the uniqueness aspect. But as I say the OP needs to evaluate this against their whole workload (including inserts)

Martin Smith
– Martin Smith

2017年12月27日 11:52:10 +00:00
Commented Dec 27, 2017 at 11:52
3

Having a separate insert point for every (key,sensor) won't lead to bad fragmentation, so long as each (key,sensor) pair has thousands of rows. See, eg blogs.msdn.microsoft.com/dbrowne/2012/06/25/…

David Browne - Microsoft
– David Browne - Microsoft

2017年12月27日 14:14:41 +00:00
Commented Dec 27, 2017 at 14:14

| Show 8 more comments

Stack Exchange Network

Questionable duplicate index suggested by optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Questionable duplicate index suggested by optimization

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions