This question is about SQL Server index performance with a varchar(2000)
as an INCLUDE
in a covering index.
I’m trying to improve performance in an slow and unstable database application. In some cases, data is accessed through large varchar strings, with the queries including multple string operations like SUBSTRING()
, SPACE()
, and DATALENGTH()
. Here is a simplified example of access;
update fattable set col3 =
SUBSTRING(col3,1,10) + '*' +
SUBSTRING(col3,12,DATALENGTH(col3)-12)
from fattable where substring(col3,10,1) = 'A' and col2 = 2
The schema looks like this:
CREATE TABLE [dbo].[FatTable](
[id] [bigint] IDENTITY(1,1) NOT NULL,
[col1] [nchar](12) NOT NULL,
[col2] [int] NOT NULL,
[col3] [varchar](2000) NOT NULL, ...
The following index has been defined, with a covering field on the large text column.
CREATE NONCLUSTERED INDEX [IndexCol2Col3] ON [dbo].[FatTable] ( [col2] ASC )
INCLUDE( [col3] )
From what I’ve read it is BAD to put large data fields in an index. I’ve been reading several articles, including http://msdn.microsoft.com/en-us/library/ms190806.aspx which discuss the impact of paging and disk size on index performance. This being said, the query plan definitely uses the covering index. I don’t have enough information to determine how much this is actually costing me in terms of system load. I do know that overall, the system is performing poorly, and I am concerned that this is one of the issues. Questions:
Is putting this
varchar(2000)
column in the indexINCLUDE
ever a good idea?Since the
INCLUDE
fields are stored in leaf nodes, do they have much impact index performance?
Update: Thanks for the excellent replies! This is an unfair question in some ways - as you guys say, there is no absolute right answer without actual statistics and profiling. Like so many performance issue, I guess the answer is "it depends".
4 Answers 4
Ever is a big word, but, in general, no, I wouldn't put a varchar(2000) field into an INCLUDE.
And yeah, the way that data is stored at the page level can seriously impact performance of the index, depending on how the index is used.
The thing is, the more rows of data you can cram into a page, the fewer pages have to get accessed, the faster your system is, for the most part. Adding a really large column means less information stored on a page, so, in the event of range seeks or scans, more pages have to be read to retreive the data, seriously slowing stuff down.
To know for sure if this is an issue on your query, or on your system, you'd have to monitor the reads, especially the number of pages that the query uses.
-
Thanks Grant. As I mentioned another comment , good performance info is scarce, hence the abstract question. I dont have experience monitoring the page size performance costs. My hunch is that it is a problem, will see if I can get some stats.RaoulRubin– RaoulRubin2012年02月13日 19:47:12 +00:00Commented Feb 13, 2012 at 19:47
-
1setting statistics IO on for the query will tell you a lot, logical reads represents the number of pages accessed. You can also monitor for secs/read from perfmon counters to get general performance info.Grant Fritchey– Grant Fritchey2012年02月13日 22:26:36 +00:00Commented Feb 13, 2012 at 22:26
Can you review the current clustered index key, and perhaps make col2
the clustered index key instead? This way you get the covering 'include' behavior (since clustered indices are always 'including' everything) w/o duplicating the data. This, of course, is subject to many if
and but
, nonetheless perhaps is worth considering. Of course if the current clustered index is enforcing a constraint (primary key, unique) said constraint would have to be moved into a non-clustered index.
-
Your suggestion about the PK is a great idea, although I won't be able to apply it in this case - existing PK is necessary for other queries. (This is a technique I will keep in the toolbox!)RaoulRubin– RaoulRubin2012年02月14日 13:22:32 +00:00Commented Feb 14, 2012 at 13:22
It is hard to answer. It will all depend on your read:write ratio. Have you tested a workload or simulated an entire business cycle on a test system, with and without the included column? The lookup without it may cost a lot, but if you're updating the data more often than you're reading it, that might be ok.
-
Overall read vs update is mostly balanced. Organizational and privacy issues make it difficult to get useful statistics and realistic tests. Since we're flying mostly blind, we have to look at things from a abstract point of view (hence this question). Testing will mean pushing changes to production and observing results - very risky.RaoulRubin– RaoulRubin2012年02月13日 19:34:46 +00:00Commented Feb 13, 2012 at 19:34
-
2And do most of the reads actually pull this
VARCHAR(2000)
column, or are you troubleshooting the performance of a very specific query that doesn't represent most of the queries? As Grant suggests if this column is not used in a lot of queries, or really causes issues for seeks, it will probably be better to pay the price for the lookup when you need it, but not paying for the storage when you don't. Again, it's really hard to tell which side of the fence you should be on, since we don't really have any specifics (and even harder because you can't test - you should strive to fix that).Aaron Bertrand– Aaron Bertrand2012年02月13日 19:59:53 +00:00Commented Feb 13, 2012 at 19:59
I know I am late for this party, but I would index exactly the expressions used for locating rows, such as substring(col3,10,1). If the whole col3 is ever used, I would index CHECKSUM(col3) (understanding that there could be collisions of course).
Explore related questions
See similar questions with these tags.
VARCHAR(2000)
which typically stores just ten characters is one thing; a solid 2,000 bytes per record is something else.