How will large index INCLUDE fields affect system performance?

Question 1

This question is about SQL Server index performance with a varchar(2000) as an INCLUDE in a covering index.

I’m trying to improve performance in an slow and unstable database application. In some cases, data is accessed through large varchar strings, with the queries including multple string operations like SUBSTRING(), SPACE(), and DATALENGTH(). Here is a simplified example of access;

update fattable set col3 = 
 SUBSTRING(col3,1,10) + '*' + 
 SUBSTRING(col3,12,DATALENGTH(col3)-12)
from fattable where substring(col3,10,1) = 'A' and col2 = 2

The schema looks like this:

CREATE TABLE [dbo].[FatTable]( 
 [id] [bigint] IDENTITY(1,1) NOT NULL, 
 [col1] [nchar](12) NOT NULL, 
 [col2] [int] NOT NULL, 
 [col3] [varchar](2000) NOT NULL, ...

The following index has been defined, with a covering field on the large text column.

CREATE NONCLUSTERED INDEX [IndexCol2Col3] ON [dbo].[FatTable] ( [col2] ASC ) 
 INCLUDE( [col3] )

From what I’ve read it is BAD to put large data fields in an index. I’ve been reading several articles, including http://msdn.microsoft.com/en-us/library/ms190806.aspx which discuss the impact of paging and disk size on index performance. This being said, the query plan definitely uses the covering index. I don’t have enough information to determine how much this is actually costing me in terms of system load. I do know that overall, the system is performing poorly, and I am concerned that this is one of the issues. Questions:

Is putting this varchar(2000) column in the index INCLUDE ever a good idea?
Since the INCLUDE fields are stored in leaf nodes, do they have much impact index performance?

Update: Thanks for the excellent replies! This is an unfair question in some ways - as you guys say, there is no absolute right answer without actual statistics and profiling. Like so many performance issue, I guess the answer is "it depends".

Question 2

How long are the actual values? A VARCHAR(2000) which typically stores just ten characters is one thing; a solid 2,000 bytes per record is something else.

Question 3

Just an observation: Something that "smells" here is that the large column may contain either 1) free text, in which case queries may benefit from rewrites to use a FULLTEXT index or 2) "human readable" coded data (e.g., wide intelligent keys, like a VIN) that could benefit from splitting into separate columns or persisted calculated columns with INDEXes. In other words, the flow of intelligence and data changes is not designed well.

Question 4

Yes #Graeme, there is a bad smell here - I think it's called "legacy". There are a multitude of problems in this databases.

Question 5

Ever is a big word, but, in general, no, I wouldn't put a varchar(2000) field into an INCLUDE.

And yeah, the way that data is stored at the page level can seriously impact performance of the index, depending on how the index is used.

The thing is, the more rows of data you can cram into a page, the fewer pages have to get accessed, the faster your system is, for the most part. Adding a really large column means less information stored on a page, so, in the event of range seeks or scans, more pages have to be read to retreive the data, seriously slowing stuff down.

To know for sure if this is an issue on your query, or on your system, you'd have to monitor the reads, especially the number of pages that the query uses.

Question 6

Thanks Grant. As I mentioned another comment , good performance info is scarce, hence the abstract question. I dont have experience monitoring the page size performance costs. My hunch is that it is a problem, will see if I can get some stats.

Question 7

setting statistics IO on for the query will tell you a lot, logical reads represents the number of pages accessed. You can also monitor for secs/read from perfmon counters to get general performance info.

Question 8

Can you review the current clustered index key, and perhaps make col2 the clustered index key instead? This way you get the covering 'include' behavior (since clustered indices are always 'including' everything) w/o duplicating the data. This, of course, is subject to many if and but, nonetheless perhaps is worth considering. Of course if the current clustered index is enforcing a constraint (primary key, unique) said constraint would have to be moved into a non-clustered index.

Question 9

Your suggestion about the PK is a great idea, although I won't be able to apply it in this case - existing PK is necessary for other queries. (This is a technique I will keep in the toolbox!)

Question 10

It is hard to answer. It will all depend on your read:write ratio. Have you tested a workload or simulated an entire business cycle on a test system, with and without the included column? The lookup without it may cost a lot, but if you're updating the data more often than you're reading it, that might be ok.

Question 11

Overall read vs update is mostly balanced. Organizational and privacy issues make it difficult to get useful statistics and realistic tests. Since we're flying mostly blind, we have to look at things from a abstract point of view (hence this question). Testing will mean pushing changes to production and observing results - very risky.

Question 12

And do most of the reads actually pull this VARCHAR(2000) column, or are you troubleshooting the performance of a very specific query that doesn't represent most of the queries? As Grant suggests if this column is not used in a lot of queries, or really causes issues for seeks, it will probably be better to pay the price for the lookup when you need it, but not paying for the storage when you don't. Again, it's really hard to tell which side of the fence you should be on, since we don't really have any specifics (and even harder because you can't test - you should strive to fix that).

Question 13

I know I am late for this party, but I would index exactly the expressions used for locating rows, such as substring(col3,10,1). If the whole col3 is ever used, I would index CHECKSUM(col3) (understanding that there could be collisions of course).

Grant Fritchey Grant Fritchey 4,62520 silver badges19 bronze badges · Accepted Answer · 2012-02-13 19:16:26Z

Ever is a big word, but, in general, no, I wouldn't put a varchar(2000) field into an INCLUDE.

And yeah, the way that data is stored at the page level can seriously impact performance of the index, depending on how the index is used.

The thing is, the more rows of data you can cram into a page, the fewer pages have to get accessed, the faster your system is, for the most part. Adding a really large column means less information stored on a page, so, in the event of range seeks or scans, more pages have to be read to retreive the data, seriously slowing stuff down.

To know for sure if this is an issue on your query, or on your system, you'd have to monitor the reads, especially the number of pages that the query uses.

Thanks Grant. As I mentioned another comment , good performance info is scarce, hence the abstract question. I dont have experience monitoring the page size performance costs. My hunch is that it is a problem, will see if I can get some stats.
setting statistics IO on for the query will tell you a lot, logical reads represents the number of pages accessed. You can also monitor for secs/read from perfmon counters to get general performance info.

Stack Exchange Network

How will large index INCLUDE fields affect system performance?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How will large index INCLUDE fields affect system performance?

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions