I have observed (and reproduced) the following scenario with SQL Server 2022.
The pattern in use
- code is executed via sp_executesql (no stored procedure is involved)
- The first query selects data into a temporary table
- A DDL statement then creates a clustered index on the temporary table. The temporary table is definitely NOT cacheable-- first of all this isn't a module (sproc or function), but also we're creating an index after the temp table is populated. So I would not expect statistics left behind on a cached temporary object to be involved at all here.
- A query selects data from the temporary table. This query gets FULL optimization each time (not a TRIVIAL plan)
This batch can run for both small and larger datasets, so that temp table can have 1 row in it or many thousands of rows.
This behavior normally occurs on a readable secondary. There is no writable query store and no automatic plan forcing as a factor.
I have verified that I can reproduce the behavior against the primary replica as well. (Automatic plan correction was told to ignore the query and I confirmed no plan forcing on primary when reproduced.)
Repro script
- Setup script - I ran this on SQL Server 2022 CU15. This turns off query store and uses compat level 130.
- Repro query - I've been running this via SQL Query Stress so I can easily run it concurrently on one or more threads
- Plan Generation Num and temp tables - A very simple query to observe the plan_generation_num in sys query stats (" A sequence number that can be used to distinguish between instances of plans after a recompile.") and the current list of temp tables
What normally happens-- and the behavior I expect
Normally, changing large amounts of rows in the temporary table between query executions automatically cause recompiles, and I'll see that the query selecting the data from the temporary table has a row estimate matching the rows in the temp table.
When this works as expected, performance is fine.
With the repro query: If I clear the plan cache, then run the repro query 40 iterations on a single thread in SQL Query Stress, plan_generation_number ends up being 82. When sampling query plans with sp_WhoIsActive, row estimates querying the temp table match the number of rows in the temp table, as expected.
What sometimes happens -- and looks like a bug to me
On rare occasions, I see that a plan is in use where there is a 1 row estimate plan for the temp table, but a very large amount of rows are actually in the temp table. A LOT of rows have changed, but it didn't automatically recompile:
This leads to very slow performance because the low estimate plan decides to use a nested loop without prefetching, which makes it a CPU burner.
With the repro query: If I clear the plan cache, then run the repro query 20 iterations on 2 threads in SQL Query Stress, plan_generation_number ends up less than 82-- it varies by run, but might be 72 or 59, indicating fewer recompiles. While this is running, I can also sample occasions with sp_WhoIsActive where there is a single rowcount estimated but many more rows in the temporary table. Screenshot:
not the query plan i expected-- why only 1 rowcount?
I can only reproduce this when running repro code on multiple concurrent sessions
I have not been able to reproduce this behavior with a single session in SQL Server. The only way I can reproduce this is to set up a block of code that:
- Executes at least 1 iteration of the sp_executesql statement that have 1 row in the temp table
- Then executes 1 iteration of the sp_executesql statement that has a lot more rows in the temp table
If I run this in a single session, I have not been able to reproduce the problems. But if I run this concurrently in four or five sessions, I'll be able to occasionally get the "THAT DIDN'T RECOMPILE LIKE IT SHOULD HAVE" issue to pop up. (Note: using SQL Query Stress, I can repro this with only 2 sessions/iterations.)
This feels like a bug to me, I'm curious if anyone else has seen it. Recompile and stats behavior with temp tables is super complex tho, so there may be some nuance that I'm missing with how this works with non-cachable temp tables.
PS: I do think cachable temp tables are generally better. I'm just trying to figure out why this behavior would happen in a non-cacheable temp table scenario at this point.
Workarounds
After adding an option (recompile)
to the query, I can no longer reproduce the reuse of the 1 row plan querying the temp table. This is sufficient, but I'm puzzled why it is necessary.
1 Answer 1
Brief Summary
With two concurrent executions of the same statement, one private runtime copy of the statement plan can recompile and update the shared recompilation threshold (RT).
The second concurrent execution sees the updated shared RT and deduces that a recompilation is unnecessary. It reuses the original plan as stored in its own private runtime copy.
The precise details are specific to temporary tables. Overall, it is a variety of the Parameter Sensitivity Problem, where the 'parameter' is the temporary table.
Scenario
Let's say two sessions A and B execute your repro at roughly the same time.
They both run the sp_executesql
prepared batch with one row. Nothing especially interesting happens here.
Both sessions then move on to the second execution with a much larger number of rows.
Let's say session A starts first. As normal, it uses a cached execution context (MXC) or derives a new one from the cached plan (CP).
Remember, a CP is for the whole batch and may therefore contain multiple statements and executable plans, each of which may independently recompile (statement level recompilation). Temporary tables are referenced by name in the CP so they can be reused. The CP is a public and re-entrant structure.
An MXC is a light copy of the CP used by a single session with all the details for a specific execution filled in. These details include local variable & parameter values, temporary table IDs (not names) and the like.
The MXC is the specific cookie you're going to eat right now, modelled from the CP cookie-cutter, if you will.
Second
Meanwhile, session B also starts on the second prepared batch execution. It also uses a cached MXC (or derives a new one).
Back with session A. Having created its temporary table and index in an interesting but uneventful fashion, it moves on to running the SELECT
at issue. Session A decides a statement level recompile is in order, does that, runs the SELECT
(with accurate row counts) and finishes.
Session B now does much the same thing but decides that a recompile is not needed. It continues with the unmodified MXC based on the one-row case and you see your problem manifest. But why?
Recompilation
Well, the SELECT
doesn't actually recompile due to updated statistics. It never does, not for session A or B. These statistics are new and unmodified each time, as far as they are concerned. It's a new and uncached temporary table, after all.
Now, the SELECT
does in fact recompile, of course, and the stated reason indicates "statistics changed". The explanation is that even when no interesting statistics are found (or they are found not to have changed, as here) SQL Server still performs one more test before deciding a recompile isn't needed.
That final test is 'threshold crossing' based solely on the total number of rows in the table. Quoting from Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005:
If a table or an indexed view T has no statistic on it, or none of the existing statistics on T are considered "interesting" during a query compilation, the following threshold-crossing test, based purely on T's cardinality, is still performed:
ABS(card(current) – card(snapshot))) >= RT
card(current)
denotes the number of rows in T at present, andcard(snapshot)
denotes the row count when the query plan was last compiled.
Session B knows how many rows it has in its local temporary table. That is much greater than one, so why doesn't the threshold test cause a recompilation? There is a reason, but it will take me a paragraph or two.
Details
There can be many simultaneous executions of a CP via MXCs. This is just the routine situation of a cached plan being run from multiple sessions at the same time.
If each session compared the known current row count for a table against the cached count in its private MXC, an excessive number of recompilations would result. Each concurrent execution would see the threshold exceeded, perform a statement level recompilation, and replace the current statement plan held in the CP.
Worse, the resulting statement's plan would likely be the same for each recompilation because permanent tables tend not to change extremely rapidly (within the timeframe of a single statement-level recompile anyway). Overall, this scheme would be highly inefficient and wasteful in the case of non-temporary tables.
So instead of that, the first concurrent execution performs its recompile and updates the CP with the new cardinality value. The other concurrent executions then compare the known cardinality with this updated value and are unlikely to cause a further statement recompilation as a result.
Further, since this is a plan optimality recompile (not a plan correctness recompile) the other executions continue with the cached plan they already had. They'll pick up the new statement plan next time around.
Temporary Inconvenience
This arrangement does not suit temporary tables so well because each session has a different 'copy' of the common CP temporary table with legitimately very different row counts.
Let's see what happens:
Session B compares its known temporary table row count (big) with the current value stored in the CP. But that CP value has just been updated (now also big) by session A after it performed its statement level recompilation!
So, session B's MXC and the shared CP's threshold numbers match and no recompile reason is determined as a result. You end up with the one-row estimate plan (from session B's MXC that didn't recompile) encountering many more rows in the current instance of the temporary table.
Not Ideal
Now, this is arguably not an ideal situation.
One might argue that temporary tables (only!) should check their known current cardinality against the MXC cached value (not the shared CP) because no other session can affect its private temporary table.
Ok, but if we recompile, where do we put the new statement plan? It might be wrong to replace the single shared version in the CP because other sessions have temporary tables with very different counts and may need different plans for good performance. Who's to say whose temporary table is better and more representative of the common case? If we throw the new plan away, we've just applied a weird sort of recompile hint without anyone asking us to.
This conundrum doesn't arise with permanent tables because there's only one of them, with a fairly static row count.
One might argue that session id could be added to the CP cache key, so each session gets its own tailored plan, much as happens with temporary tables created outside the batch they are referenced in. Well, no thank you; such cases already produce quite enough plan cache pollution as it is. And don't forget that would apply to the entire batch of statements, not just the tricky one.
Or perhaps temporary tables should treat the 'optimality' recompile as if it were a 'correctness' recompile, meaning they would use the updated statement plan in the CP. Maybe, but they don't, and there are probably too many special cases for temporary tables as it is. It's still likely to end up as a different form of the parameter sensitivity problem in any case.
Conclusion
It's a timing issue. One session updates the CP-shared threshold and a recompiled statement-level plan, while the other session has a cached MXC threshold value that does not differ sufficiently (or at all, in this case) from the updated value. This causes the second session to sub-optimally reuse the existing statement plan as cached in its private MXC.
The best solution seems to be to add a local OPTION (RECOMPILE)
hint on the problematic statement, as you have already discovered.
Yes, you end up recompiling that statement more often than strictly necessary, but perhaps that is better than the occasional performance disaster caused through the scenario above. In this case, it surely is. The recompilation time is low, and the statement runs for a decent length of time in comparison.
Explore related questions
See similar questions with these tags.