17

Why does SQL server use parallelism when running this query which uses a subquery but it doesn't when using a join? The join version runs in serial and takes around 30 times longer to complete.

Join version: ~30secs

enter image description here

Subquery version: <1second

enter image description here

EDIT: Xml versions of query plan:

JOIN version

SUBQUERY version

Aaron Bertrand
182k28 gold badges406 silver badges625 bronze badges
asked Jan 22, 2014 at 15:05
0

1 Answer 1

13

As already indicated in the comments it looks as though you need to update your statistics.

The estimated number of rows coming out of the join between location and testruns is hugely different between the two plans.

Join plan estimates: 1

Plan 1

Sub query plan estimates: 8,748

enter image description here

The actual number of rows coming out of the join is 14,276.

Of course it makes absolutely no intuitive sense that the join version should estimate that 3 rows should come from location and produce a single joined row whereas the sub query estimates that a single one of those rows will produce 8,748 from the same join but nonetheless I was able to reproduce this.

This seems to happen if there is no cross over between the histograms when the statistics are created. The join version assumes a single row. And the single equality seek of the sub query assumes the same estimated rows as an equality seek against an unknown variable.

The cardinality of testruns is 26244. Assuming that is populated with three distinct location ids then the following query estimates that 8,748 rows will be returned (26244/3)

declare @i int
SELECT *
FROM testruns AS tr
WHERE tr.location_id = @i

Given that the table locations only contains 3 rows it is easy (if we assume no foreign keys) to contrive a situation where the statistics are created and then the data is altered in a way that dramatically effects the actual number of rows returned but is insufficient to trip the auto update of stats and recompile threshold.

As SQL Server gets the number of rows coming out of that join so wrong all the other row estimates in the join plan are massively underestimated. As well as meaning that you get a serial plan the query also gets an insufficient memory grant and the sorts and hash joins spill to tempdb.

One possible scenario that reproduces the actual vs estimated rows shown in your plan is below.

CREATE TABLE location
 (
 id INT CONSTRAINT locationpk PRIMARY KEY,
 location VARCHAR(MAX) /*From the separate filter think you are using max?*/
 )
/*Temporary ids these will be updated later*/
INSERT INTO location
VALUES (101, 'Coventry'),
 (102, 'Nottingham'),
 (103, 'Derby')
CREATE TABLE testruns
 (
 location_id INT
 )
CREATE CLUSTERED INDEX IX ON testruns(location_id)
/*Add in 26244 rows of data split over three ids*/
INSERT INTO testruns
SELECT TOP (5984) 1
FROM master..spt_values v1, master..spt_values v2
UNION ALL
SELECT TOP (5984) 2
FROM master..spt_values v1, master..spt_values v2
UNION ALL
SELECT TOP (14276) 3
FROM master..spt_values v1, master..spt_values v2
/*Create statistics. The location_id histograms don't intersect at all*/
UPDATE STATISTICS location(locationpk) WITH FULLSCAN; 
UPDATE STATISTICS testruns(IX) WITH FULLSCAN;
/* UPDATE location.id. Three row update is below recompile threshold*/
UPDATE location
SET id = id - 100

Then running the following queries gives the same estimated vs actual discrepancy

SELECT *
FROM testruns AS tr
WHERE tr.location_id = (SELECT id
 FROM location
 WHERE location = 'Derby')
SELECT *
FROM testruns AS tr
 JOIN location loc
 ON tr.location_id = loc.id
WHERE loc.location = ( 'Derby' ) 
answered Jan 23, 2014 at 0:00
8
  • If a unique constraint is added on location then it becomes obvious that "=" will return exactly one row. Then in your example the query plans become identical (scans -> seeks): alter table Location add constraint U_Location_Location unique nonclustered (Location); Commented Jan 23, 2014 at 0:43
  • @crokusek yes. Realised what you meant afterwards and deleted my comment! Does that increase the estimated number of rows for the join version to be same as the subquery as well? Not on PC at moment to test? Commented Jan 23, 2014 at 1:12
  • @crokusek Yep. looks like same estimated rows out of the join as for the sub query in that singleton case. Commented Jan 23, 2014 at 1:21
  • Yes. Identical query plan, both estimates 8748, both actuals 14276. Btw, I thought pre-computing the locationId would resolve that difference but it does not. Commented Jan 23, 2014 at 1:24
  • 1
    @crokusek - I will also add the unique constraint to location and other similar places within my DB. I must admit I didn't realise it affected query optimisation. I thought it was just to ensure data integrity. Thanks for your input on this question. Commented Jan 24, 2014 at 13:31

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.