The source data represents interaction between people. Some are internal, others are external.
The internals are recorded in the Users
table (represented as the Users
CTE in this demonstration).
Each Entries
record is identified by an ID (ItemID
), the time of the interaction (Sequence
) and who performed the interaction (UserID
).
The goal is to have a single line per ItemID
with the following columns:
ItemID
- self explanatoryFirstSequence
-Sequence
value of first interaction (remember in reality this is a time-stamp)firstInternal
-UserID
of first user thatIsInternal
FirstInternalSequence
-Sequence
offirstInternal
CountPerFirstInternal
- A count of all interactions byfirstInternal
userCountAllInternal
- A count of all interaction with anyIsInternal
userCountAll
- Count of all interactions forItemID
LastSequence
- The last interaction forItemID
- allows to measure the 'age' of the interaction.
----- Demo source data BEGIN
WITH Users AS (
SELECT 'A' AS UserID UNION ALL
SELECT 'B' AS UserID
)
, Entries AS (
SELECT 10001 ItemID, 'X' AS UserID, 101 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 102 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 103 AS Sequence UNION ALL
SELECT 10001 ItemID, 'B' AS UserID, 104 AS Sequence UNION ALL
SELECT 10001 ItemID, 'X' AS UserID, 105 AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, 106 AS Sequence
UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 201 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 202 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 203 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 204 AS Sequence UNION ALL
SELECT 10020 ItemID, 'A' AS UserID, 205 AS Sequence UNION ALL
SELECT 10020 ItemID, 'Y' AS UserID, 206 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 207 AS Sequence UNION ALL
SELECT 10020 ItemID, 'B' AS UserID, 208 AS Sequence
UNION ALL
SELECT 10300 ItemID, 'A' AS UserID, 301 AS Sequence UNION ALL
SELECT 10300 ItemID, 'Z' AS UserID, 302 AS Sequence UNION ALL
SELECT 10300 ItemID, 'Z' AS UserID, 303 AS Sequence UNION ALL
SELECT 10300 ItemID, 'Z' AS UserID, 304 AS Sequence UNION ALL
SELECT 10300 ItemID, 'A' AS UserID, 305 AS Sequence UNION ALL
SELECT 10300 ItemID, 'Z' AS UserID, 306 AS Sequence UNION ALL
SELECT 10300 ItemID, 'A' AS UserID, 307 AS Sequence
)
----- Demo source data END
----- Code I am asking about
, Src AS (
SELECT
e.ItemID
, e.UserID
, e.Sequence
, CASE WHEN u.UserID IS NULL THEN 0 ELSE 1 END AS IsInternal
FROM Entries AS e
LEFT JOIN Users as u
ON u.UserID = e.UserID
)
, Src_UserID AS (
SELECT
*
, ROW_NUMBER() OVER (
PARTITION BY Src_UserID.ItemID, Src_UserID.IsInternal
ORDER BY Src_UserID.FirstUserSequence
) AS RC
FROM (
SELECT
src.ItemID
, src.IsInternal
, src.UserID
, COUNT(*) AS CountPerUser
, MIN(src.Sequence) AS FirstUserSequence
FROM src
GROUP BY src.ItemID, src.IsInternal, src.UserID
) as Src_UserID
)
, Src_Items AS (
SELECT
src.ItemID
, COUNT(*) AS CountAll
, SUM(IsInternal) AS CountAllInternal
, MIN(src.Sequence) AS FirstSequence
, MAX(src.Sequence) AS LastSequence
FROM src
GROUP BY src.ItemID
)
, Src_FirstInternal AS (
SELECT
src.ItemID
, src.UserID AS firstInternal
, src.CountPerUser AS CountPerFirstInternal
, MIN(src.FirstUserSequence) AS FirstInternalSequence
FROM Src_UserID AS src
WHERE src.IsInternal = 1 AND src.RC = 1
GROUP BY src.ItemID, src.IsInternal, src.UserID, src.CountPerUser
)
SELECT
s0.ItemID
, s0.FirstSequence
, s1.firstInternal
, s1.FirstInternalSequence
, s1.CountPerFirstInternal
, s0.CountAllInternal
, s0.CountAll
, s0.LastSequence
FROM Src_Items as s0
JOIN Src_FirstInternal AS s1
ON s1.ItemID = s0.ItemID
Difference between the Demo code and real life:
- Items are in tables and not
UNION ALL CTE
s - The list represented by
Entries
in the demo, in reality is 800K rows, and takes 8 seconds to retrieve. Sequence
column is actually a date.
Execution plan:
Inspecting the code in the Execution Plan, I see that most of the time is spent on SORT
, and it occurs 3 times.
Goal
I'm trying to get the above query to perform better. Running this on even on a limited set of results takes ages. I wonder if there is a better way of writing this query.
2 Answers 2
Before we begin with the code...
I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:
Difference between the Demo code and real life:
- Items are in tables and not UNION ALL CTEs
- The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.
- Sequence column is actually a date.
While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:
- Created temp tables
#Users
and#Entries
including keys and indexes (clustered indexes created automatically on primary key constraints - N/A - cannot produce 800K rows of demo data
- Changed sequence column to
DATETIME
type and seeded demo data using aRAND()
formula withDATEADD()
. While not completely identical, it should be "close enough".
New demo data:
----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
UserID VARCHAR(100) NOT NULL,
CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
ItemID INT NOT NULL,
UserID VARCHAR(100) NULL,
Sequence DATETIME NOT NULL,
CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
SELECT 'A' UNION ALL
SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
-- etc.
SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data END
For reference to others looking at this, the result set after running the whole query with sample data is as follows:
Performance
This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.
Note: I will make changes mainly in formatting as we go along.
1. Duplicated Index Scans
Both of those identical scans come from the Src
CTE, which is called from 2 the other 2 CTEs separately. I was looking for a way to eliminate the left join in favor of an existence check, however due to needing the u.UserID
in your IsInternal
field we will have to keep this join. One possibility, if this kind of operation (checking whether a user is internal) is something that is done frequently in your code base, you may consider adding an IsInternal
boolean/bit column in Entries
so you could eliminate this join altogether from your code base when you need to check if an entry is internal.
I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.
WITH Src AS (
SELECT
Src_Ent.ItemID
, Src_Ent.UserID
, Src_Ent.Sequence
/*If the user for this item sequence is not found in Users, we mark it as Internal.*/
, (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
FROM #Entries AS Src_Ent
LEFT JOIN #Users AS Src_Usr
ON Src_Usr.UserID = Src_Ent.UserID
)
Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read. #Entries AS Src_Ent
was e
and #Users AS Src_Usr
was u
. I also wrapped the CASE
expression in round brackets to help isolate it visually from its alias. I added a bit of documentation to the CASE
statement.
2. Sort #1 - Src_UserID_Sub
This Sort results from the GROUP BY
clause of Src_UserID_Sub
subquery. Unfortunately it's not possible to eliminate this expensive sort, as rows must be sorted prior to being grouped. It's possible that this would be less expensive if you used a temp table as suggested in step (1) if you had a clustered index, for example by making an artificial primary key such as a RowID INT IDENTITY(1,1)
column on the temp table.
3. Sort #2 - Src_UserID
with ROW_NUMBER()
This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised: The function 'ROW_NUMBER' must have an OVER clause with ORDER BY.
It might be possible to do away with ROW_NUMBER()
, but that might also be more harmful than beneficial, as it would likely require a loop or some other construct that is not very "SQL-ish", and in the end, the query optimizer can probably work better with this built-in function than if we rolled our own. So again, very little optimization possible. I did eliminate the SELECT *
in favor of enumerating the columns, as it makes it easier to understand, and in general SELECT *
should usually be avoided for a variety of reasons.
, Src_UserID AS (
SELECT
Src_UserID_Sub.ItemID
, Src_UserID_Sub.IsInternal
, Src_UserID_Sub.UserID
, Src_UserID_Sub.CountPerUser
, Src_UserID_Sub.FirstUserSequence
, ROW_NUMBER() OVER (
PARTITION BY
Src_UserID_Sub.ItemID
, Src_UserID_Sub.IsInternal
ORDER BY
Src_UserID_Sub.FirstUserSequence
) AS [RowCount]
FROM (
/*This subquery is used to get the number of entries per user,
as well as the earliest sequence related to said entries*/
SELECT
Src.ItemID
, Src.IsInternal
, Src.UserID
, COUNT(*) AS CountPerUser
, MIN(Src.Sequence) AS FirstUserSequence
FROM Src
GROUP BY Src.ItemID, Src.IsInternal, Src.UserID
) AS Src_UserID_Sub
Improvements to formatting: Changed subquery alias from Src_UserID
to Src_UserID_Sub
to differentiate it from the CTE name and therefore make the code less ambiguous. Got rid of SELECT *
as mentioned above. Added a small amount of documentation explaining what the subquery is for.
4. Hash Match
So here is the most expensive operation in your whole execution, at 31% operator cost. I'm going to quote one of the pros on DBA.StackExchange:
The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins.
Now, the thing is with hash joins, they are not necessarily slow, but they can be slow, depending on the memory load of the server at the time it is being executed. performance can also vary wildly based on the size of the build input vs. the amount of memory available, as the SQL optimizer will attempt to hold the hash table in memory, if it can. If it cannot due to insufficient memory, then it has to resort to more complex constructs such as Grace Hash Join and Recursive Hash Join.
The type of hash join that is used is not easily discerned when optimizing, as this is done dynamically. Per TechNet page on "Understanding Hash Joins":
It is not always possible during optimization to determine which hash join is used. Therefore, SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.
This one being less predictable, you will have to benchmark different solutions and compare the results. Would using temp tables instead of CTEs help? Maybe. Maybe not. Only way to know for sure is trying it in your environment.
SELECT
items.ItemID
, items.FirstSequence
, internal.firstInternal
, internal.FirstInternalSequence
, internal.CountPerFirstInternal
, items.CountAllInternal
, items.CountAll
, items.LastSequence
FROM Src_Items AS items
JOIN Src_FirstInternal AS internal
ON internal.ItemID = items.ItemID
;
Formatting improvements: changed aliases as such: s0 -> items
and s1 -> internal
.
Overall
I think overall your SQL code is quite well written. From the looks of it, this probably belongs in a stored procedure. If it doesn't, then maybe you should make it so, that would give you further performance improvement by saving the execution plan after first execution.
-
\$\begingroup\$ WOW, Thanks @Phrancis! I appreciate the time and thought you put into this. Just for clarification - you say use temp table instead of CTE, I assume you mean the Src CTE. I'll try that. \$\endgroup\$Lockszmith– Lockszmith2015年12月17日 23:28:55 +00:00Commented Dec 17, 2015 at 23:28
-
\$\begingroup\$ @Lockszmith I was thinking a
#TempTable
that you just drop at the end. I think this would help you offset memory cost to an extent by using some disk instead. But like I said, under different loads and on different machines, only testing can tell! \$\endgroup\$Phrancis– Phrancis2015年12月18日 00:14:20 +00:00Commented Dec 18, 2015 at 0:14 -
\$\begingroup\$ I finally figured it out, still keeping this as "The Answer" though. I'll post a full answer in the coming days, when I have some more time. My solution involves doing the first query with GROUPING SETS, the main CTE feeds other CTEs which using the GROUPING values to create different sets of the data. This allows a single sort operation on the main dataset, and then just plucking what's relevant within, I'm down from never completing (I've usually stopped it after 20 minutes) to 15 seconds. \$\endgroup\$Lockszmith– Lockszmith2016年01月07日 23:00:17 +00:00Commented Jan 7, 2016 at 23:00
-
\$\begingroup\$ That's really awesome, glad you got it improved that much! \$\endgroup\$Phrancis– Phrancis2016年01月07日 23:07:31 +00:00Commented Jan 7, 2016 at 23:07
I'd like to start by thanking Phrancis for his time, and detailed answer.
A while after I got the answer I finally revisited the issue with a clear head and I solved the solution to my satisfaction.
The trick here was to reduce the amount of SORT operation the query was doing. With how it was currently setup, it was sorting nested queries, which is very expensive, since this is in effect a recursive operation.
SQL helps us with GROUPING SETS where it can perform a single sort operation, and provides different aspects of the grouping, identified by the result of GROUPING() function.
By utilizing this on the source query, I was able to reduce the cost to a single SORT operation performed on the original data set, and then just 'slicing' different sub-sections of the resulted dataset (the main CTE), into secondary CTEs.
This proved to be very effective. Turning a query that would not complete even after 20 minutes, to one that runs for a few seconds on the entire data-set.
I'm posting here in the hopes this proves helpful to others looking at this question.
Instead of sorting in multiple CTEs, I'm performing a GROUPING SETS operation on the first CTE and then reusing it on following CTEs.
Below is the final code. Let me know if you have any questions regarding this implementation.
Demo Data (Complete As Temp-Tables)
----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
UserID VARCHAR(100) NOT NULL,
CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
ItemID INT NOT NULL,
UserID VARCHAR(100) NULL,
Sequence DATETIME NOT NULL,
CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
SELECT 'A'
UNION ALL SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10001, 'A', DATEADD(MINUTE, (1 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10001, 'B', DATEADD(MINUTE, (1 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10001, 'A', DATEADD(MINUTE, (1 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
UNION ALL
SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'A', DATEADD(MINUTE, (2 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 7, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 8, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
UNION ALL
SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 7, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
GO
----- Demo source data END
CODE
WITH Src AS (
SELECT
e.ItemID
, e.UserID
, e.IsInternal
, MIN(e.Sequence) minSequence
, MAX(e.Sequence) maxSequence
, COUNT(e.Sequence) _Count
, GROUPING(e.UserID) gUserID
, GROUPING(e.IsInternal) gIsInternal
FROM (
SELECT
e.ItemID
, CASE WHEN u.UserID IS NULL THEN 0 ELSE 1 END AS IsInternal
, e.UserID
, e.Sequence
FROM #Entries AS e
LEFT JOIN #Users as u
ON u.UserID = e.UserID
) AS e
GROUP BY GROUPING SETS ((e.ItemID), (e.ItemID, e.IsInternal), (e.ItemID, e.IsInternal, e.UserID))
)
, Src_1stUsers AS (
SELECT
Src.ItemID
, Src.UserID FirstInternal
, Src.minSequence FirstInternalSequence
, Src._Count CountPerFirstInternal
FROM Src
INNER JOIN (
SELECT
_Src.ItemID
, MIN(minSequence) minSequence
FROM Src AS _Src
WHERE
_Src.gUserID = 0
AND _Src.IsInternal = 1
AND _Src.UserID is not NULL
GROUP BY _Src.ItemID
) AS Src1
ON Src.ItemID = Src1.ItemID
AND Src.minSequence = Src1.minSequence
WHERE
Src.UserID is not NULL
),
Src_Item AS (
SELECT
Src.ItemID
, Src.minSequence FirstSequence
, Src.maxSequence LastSequence
, Src._Count CountAll
FROM Src
WHERE Src.gUserID = 1 AND Src.gIsInternal = 1
),
Src_Internal AS (
SELECT
Src.ItemID
, Src.minSequence FirstInternalSequence
, Src._Count CountAllInternal
FROM Src
WHERE Src.gUserID = 1 AND Src.gIsInternal = 0 AND Src.IsInternal = 1
)
SELECT
--* FROM Src
i.ItemID
, sItm.FirstSequence
, s1st.FirstInternal
, s1st.FirstInternalSequence
, s1st.CountPerFirstInternal
, sInt.CountAllInternal
, sItm.CountAll
, sItm.LastSequence
FROM (
SELECT DISTINCT ItemID FROM Src
) AS i
LEFT JOIN Src_1stUsers AS s1st
ON i.ItemID = s1st.ItemID
LEFT JOIN Src_Item as sItm
ON i.ItemID = sItm.ItemID
LEFT JOIN Src_Internal AS sInt
ON i.ItemID = sInt.ItemID
GO
With the demo data the effect isn't noticeable, but with the real dataset, by the 3rd Step 99% of the query time has past.
Execution Plan of Demo Data execution plan of demo data
Execution Plan of REAL data set * execution plan of real data
* The real query doesn't use the SELECT DISTINCT, but a list of items that already exists in the database.
Entries
etc. in a#TempTable
instead of a giant CTE withUNION ALL
. \$\endgroup\$