Why does SQL Server run a subquery for each row of the table it's qualifying?

Question 1

This query runs in ~21 seconds (execution plan):

select 
 a.month
 , count(*) 
from SubqueryTest a 
where a.year = (select max(b.year) from SubqueryTest b)
group by a.month

When the subquery is replaced with a variable, it runs in <1 second (execution plan):

declare @year float
select @year = max(b.year) from SubqueryTest b
select 
 month
 , count(*) 
from SubqueryTest where year = @year group by month

Judging from the execution plan, the "select max..." sub-select is run for each of the millions of rows in "SubqueryTest a:, which is why it takes so long.

My question: Since the sub-select is scalar, deterministic and not correlated, why doesn't the query optimizer do what I did in my second example and run the subquery once, store the result, then use it for the main query? I'm sure there's just a hole in my understanding of SQL Server, but I'd really like help filling it - a couple hours with google haven't helped.

The table is just over 1gb with almost 28 million records:

CREATE TABLE SubqueryTest(
 [pk_id] [int] IDENTITY(1,1) NOT NULL
 , [Year] [float] NULL
 , [Month] [float] NULL PRIMARY KEY CLUSTERED ([pk_id] ASC))
CREATE NONCLUSTERED INDEX idxSubqueryTest ON SubqueryTest ([Year] ASC)

Question 2

My wonder is why you have Year as float. Sorry, no, that makes sense for Stardates. But Month as float? Really buffles me.

Question 3

Can you supply the execution plans?

Question 4

@ypercube :) This data behind this came from Access (which is still the front-end). It was migrated from using the Access-to-SQL-Server migration wizard which likes floats.

Question 5

@MartinSmith Execution plans added.

Question 6

What is the result of moving your sub-query from the where clause into the body of the query as a inner join? select a.month, count(*) from SubQueryTest a join (select max(year) as [year] from SubQueryTest b) as b on a.year = b.year group by a.month

Question 7

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
 AS (SELECT TOP(1) WITH TIES *
 FROM SubqueryTest
 WHERE year IS NOT NULL
 ORDER BY year desc)
SELECT month,
 count(*)
FROM CTE
GROUP BY month

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here

Question 8

Frankly, this seems like a performance/optimizer bug to me. It's allowed to assume a static, stable result from the subquery and cache it, so why doesn't it just always do that, irrespective of the estimated row counts? When is the plan its chosen ever better?

Question 9

@RBarryYoung - Well the plan with the variable isn't great either! If there were only a handful of duplicates for the TOP 1 this would be the best plan. The bug to me is that it doesn't look at average selectivity for that column when estimating rows for TOP 1 WITH TIES

Question 10

Hmmm, this is still confusing/weird. Why I'm looking at the slow plan and some of the "Estimated subtree cost"s are a lot less than the individual IO and CPU costs. Maybe my brain just isn't working today, but that seems impossible to me...?

Question 11

@RBarryYoung - Because it is under a TOP 1 so they get scaled down for a row goal. SQL Server estimates that the TOP will stop requesting rows after the first row is received. In fact as the first 4,424,803 rows out of the index scan have the same year it takes many more than that.

Question 12

@RBarryYoung - Paul White looks at this specifically in his answer here. Does that help?

Martin Smith Martin Smith 88.4k15 gold badges257 silver badges357 bronze badges · Accepted Answer · 2014-01-23 16:00:24Z

The slow plan isn't calculating the MAX for each row in the outer query.

In fact it never explicitly calculates it at all.

It gives a plan similar to

WITH CTE
 AS (SELECT TOP(1) WITH TIES *
 FROM SubqueryTest
 WHERE year IS NOT NULL
 ORDER BY year desc)
SELECT month,
 count(*)
FROM CTE
GROUP BY month

Slow Plan (Estimated Row Counts)

enter image description here

You have a non covering index on year asc so it scans that backwards to get the rows in the first year (shows as a seek because of the implicit IS NOT NULL predicate).

Unfortunately it doesn't seem to differentiate between TOP 1 and TOP 1 WITH TIES when estimating row counts.

In this case it makes a huge difference. (estimated 2 key lookup vs actual 4,424,803) so you get an inappropriate plan.

Slow Plan (Actual Row Counts)

enter image description here

You could consider adding month into the index on year either as a key or included column to make the index covering. The benefit of adding it as a secondary key column would be that it could then feed into a stream aggregate without an additional sort (though for only 12 distinct values a hash aggregate would be fine anyway).

A non covering index on such a non selective column is really pretty useless for the vast majority of queries. The index is totally ignored by the "fast" plan which ends up doing a parallel scan on the whole table and evaluating the predicate on all 27,445,400 rows (in preference to performing the huge number of lookups).

enter image description here

Frankly, this seems like a performance/optimizer bug to me. It's allowed to assume a static, stable result from the subquery and cache it, so why doesn't it just always do that, irrespective of the estimated row counts? When is the plan its chosen ever better?
@RBarryYoung - Well the plan with the variable isn't great either! If there were only a handful of duplicates for the TOP 1 this would be the best plan. The bug to me is that it doesn't look at average selectivity for that column when estimating rows for TOP 1 WITH TIES
Hmmm, this is still confusing/weird. Why I'm looking at the slow plan and some of the "Estimated subtree cost"s are a lot less than the individual IO and CPU costs. Maybe my brain just isn't working today, but that seems impossible to me...?
@RBarryYoung - Because it is under a TOP 1 so they get scaled down for a row goal. SQL Server estimates that the TOP will stop requesting rows after the first row is received. In fact as the first 4,424,803 rows out of the index scan have the same year it takes many more than that.
@RBarryYoung - Paul White looks at this specifically in his answer here. Does that help?

Stack Exchange Network

Why does SQL Server run a subquery for each row of the table it's qualifying?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Why does SQL Server run a subquery for each row of the table it's qualifying?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions