I am having trouble optimizing a query that does an inner join using a date range. The purposes of the query is take daily data and summarize by week.
Select pcw.EndDate WeekEndDate, h.Store, SUM(h.DeliveryChargesTotal) DeliveryChargesTotal
from Daily_GC_Headers h
inner join PeriodCalendar_Weeks pcw
on h.SalesDate between pcw.StartDate and pcw.EndDate
where SalesDate between @StartDate and @EndDate and isCanceled = 0
group by pcw.EndDate, h.Store
Simplified schema of Daily_GC_Headers
table (13.8 million rows; about 5.4 million match criteria in WHERE clause):
Store - Varchar(10) (PK)
SalesDate - Date (PK)
TicketNumber - SmallInt (PK; starts over 1 each day at each store.)
IsCanceled - Bit
DeliveryChargesTotal - Decimal(9,2)
Simplified schema of PeriodCalendar_Weeks
Table (570 rows; 53 match the criteria):
Year - smallint (PK)
Period - tinyint (PK)
Week - tinyint (PK)
StartDate - Date
EndDate - Date
This query takes about 15 seconds in SSMS. Querying Daily_GC_Headers
by itself (and just grouping by Store
) takes 2 seconds. A query against PeriodCalendar_Weeks
is "instant".
DBCC SHOW_STATISTICS
indicates that the stats are both tables are current (we run a weekly job to update them). I've tried clearing the plan caches.
The execution plan is strange. For example, it is doing an Eager Spool on PeriodCalendar_Weeks
. The estimated rows is 156.6 but the actual rows is 153,971. It then filters the results of that first spool and does a Lazy Spool. The estimated/actual rows of that 2nd spool is 5.4 million, even though the underlying table has less than 600 rows in it.
What should I be looking for or doing to optimize this?
Additional Information
For sake of clarity, I initially described an oversimplified PK on the Weeks table. I have update the schema above to show the full key. The PK described for the Headers is (and was) the full key.
Screen shot of some rows from the Weeks table: enter image description here
Stats from the Weeks table: enter image description here
Some stats from Headers table. There seems to be an histogram record for about every 5-10 days for the entire history in the table (3 years). enter image description here
1 Answer 1
Much more efficient to do this without having to go back and join to the periods table.
DECLARE @StartDate DATE, @EndDate DATE;
Select @StartDate = Min(StartDate), @EndDate = MAX(EndDate)
from dbo.PeriodCalendar_Weeks pcw
where (pcw.Year = @Year and pcw.Period < @Period)
or (pcw.Year = @Year and pcw.Period = @Period and pcw.Week <= @Week)
or (pcw.Year = @Year -1 and pcw.Period >= @Period);
SELECT
WeekEndDate = DATEADD(DAY, 6, DATEADD(WEEK, SalesWeek, @StartDate)),
Store,
DeliveryChargesTotal = dct
FROM
(
SELECT DATEDIFF(DAY, @StartDate, SalesDate)/7, Store, SUM(DeliveryChargesTotal)
FROM dbo.Daily_GC_Headers
WHERE SalesDate BETWEEN @StartDate AND @EndDate AND isCanceled = 0
GROUP BY DATEDIFF(DAY, @StartDate, SalesDate)/7, Store
) AS x (SalesWeek, Store, dct)
ORDER BY WeekEndDate, Store;
A filtered index may help, if many rows exist where isCanceled = 1
(these are just possible suggestions, depending on cardinality of Store
, and may not be the most optimal):
CREATE INDEX x ON dbo.Daily_GC_Headers
(SalesDate) INCLUDE (Store, DeliveryChargesTotal)
WHERE isCanceled = 0;
If there are very few rows where isCanceled = 1
, this may be better:
CREATE INDEX x ON dbo.Daily_GC_Headers
(SalesDate, IsCanceled) INCLUDE (Store, DeliveryChargesTotal);
Both are worth trying on a test system, as well as moving Store
into the key in either case, or moving IsCanceled
to the INCLUDE
list in the latter case. On my system, I found the best results with everything but the date in the INCLUDE
list:
CREATE INDEX x ON dbo.Daily_GC_Headers
(SalesDate) INCLUDE (Store, IsCanceled, DeliveryChargesTotal);
Again, you will need to test if any of these work out, or if the query above gives a different/better recommendation directly from SQL Server.
-
IsCanceled=1 on at least 95% of the records.poke– poke2014年03月18日 19:58:41 +00:00Commented Mar 18, 2014 at 19:58
-
There are only about ~110 different values for Store so I'd say its pretty low.poke– poke2014年03月18日 20:03:25 +00:00Commented Mar 18, 2014 at 20:03
-
@AaronBertand: This query ran in 2 seconds, which is how fast a straight SELECT...SUM()...GROUP BY is on that table with no join. Execution plan didn't recommend any indexes, though.poke– poke2014年03月18日 20:12:24 +00:00Commented Mar 18, 2014 at 20:12
-
@poke so eliminating the redundant join is better, yes? Is the output different from your slower version of the query, or missing any information?Aaron Bertrand– Aaron Bertrand2014年03月18日 20:14:48 +00:00Commented Mar 18, 2014 at 20:14
-
Yes, its better. My query was taking 15 seconds. The Week End Date is actually showing the week start date, but I know how to fix that.poke– poke2014年03月18日 20:24:30 +00:00Commented Mar 18, 2014 at 20:24
Explore related questions
See similar questions with these tags.
set statistics io
and compared logical reads?isCanceled
column is in theDaily_GC_Headers
table, right? Add also in the question the datatypes of the columns and the indexes you have.