Query to count interactions between users

Question 1

The source data represents interaction between people. Some are internal, others are external.

The internals are recorded in the Users table (represented as the Users CTE in this demonstration).

Each Entries record is identified by an ID (ItemID), the time of the interaction (Sequence) and who performed the interaction (UserID).

The goal is to have a single line per ItemID with the following columns:

ItemID - self explanatory
FirstSequence - Sequence value of first interaction (remember in reality this is a time-stamp)
firstInternal - UserID of first user that IsInternal
FirstInternalSequence - Sequence of firstInternal
CountPerFirstInternal - A count of all interactions by firstInternal user
CountAllInternal - A count of all interaction with any IsInternal user
CountAll - Count of all interactions for ItemID
LastSequence - The last interaction for ItemID - allows to measure the 'age' of the interaction.

----- Demo source data BEGIN
WITH Users AS (
 SELECT 'A' AS UserID UNION ALL
 SELECT 'B' AS UserID
)
, Entries AS (
 SELECT 10001 ItemID, 'X' AS UserID, 101 AS Sequence UNION ALL
 SELECT 10001 ItemID, 'A' AS UserID, 102 AS Sequence UNION ALL
 SELECT 10001 ItemID, 'X' AS UserID, 103 AS Sequence UNION ALL
 SELECT 10001 ItemID, 'B' AS UserID, 104 AS Sequence UNION ALL
 SELECT 10001 ItemID, 'X' AS UserID, 105 AS Sequence UNION ALL
 SELECT 10001 ItemID, 'A' AS UserID, 106 AS Sequence
 UNION ALL
 SELECT 10020 ItemID, 'Y' AS UserID, 201 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'Y' AS UserID, 202 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'B' AS UserID, 203 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'Y' AS UserID, 204 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'A' AS UserID, 205 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'Y' AS UserID, 206 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'B' AS UserID, 207 AS Sequence UNION ALL
 SELECT 10020 ItemID, 'B' AS UserID, 208 AS Sequence
 UNION ALL
 SELECT 10300 ItemID, 'A' AS UserID, 301 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'Z' AS UserID, 302 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'Z' AS UserID, 303 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'Z' AS UserID, 304 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'A' AS UserID, 305 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'Z' AS UserID, 306 AS Sequence UNION ALL
 SELECT 10300 ItemID, 'A' AS UserID, 307 AS Sequence
)
----- Demo source data END
----- Code I am asking about
, Src AS (
 SELECT
 e.ItemID
 , e.UserID
 , e.Sequence
 , CASE WHEN u.UserID IS NULL THEN 0 ELSE 1 END AS IsInternal
 FROM Entries AS e
 LEFT JOIN Users as u
 ON u.UserID = e.UserID
)
, Src_UserID AS (
 SELECT
 *
 , ROW_NUMBER() OVER (
 PARTITION BY Src_UserID.ItemID, Src_UserID.IsInternal
 ORDER BY Src_UserID.FirstUserSequence
 ) AS RC
 FROM (
 SELECT
 src.ItemID
 , src.IsInternal
 , src.UserID
 , COUNT(*) AS CountPerUser
 , MIN(src.Sequence) AS FirstUserSequence
 FROM src
 GROUP BY src.ItemID, src.IsInternal, src.UserID
 ) as Src_UserID
)
, Src_Items AS (
 SELECT
 src.ItemID
 , COUNT(*) AS CountAll
 , SUM(IsInternal) AS CountAllInternal
 , MIN(src.Sequence) AS FirstSequence
 , MAX(src.Sequence) AS LastSequence
 FROM src
 GROUP BY src.ItemID
)
, Src_FirstInternal AS (
 SELECT
 src.ItemID
 , src.UserID AS firstInternal
 , src.CountPerUser AS CountPerFirstInternal
 , MIN(src.FirstUserSequence) AS FirstInternalSequence
 FROM Src_UserID AS src
 WHERE src.IsInternal = 1 AND src.RC = 1
 GROUP BY src.ItemID, src.IsInternal, src.UserID, src.CountPerUser
)
 SELECT
 s0.ItemID
 , s0.FirstSequence
 , s1.firstInternal
 , s1.FirstInternalSequence
 , s1.CountPerFirstInternal
 , s0.CountAllInternal
 , s0.CountAll
 , s0.LastSequence
 FROM Src_Items as s0
 JOIN Src_FirstInternal AS s1
 ON s1.ItemID = s0.ItemID

Difference between the Demo code and real life:

Items are in tables and not UNION ALL CTEs
The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.
Sequence column is actually a date.

Execution plan:

Inspecting the code in the Execution Plan, I see that most of the time is spent on SORT, and it occurs 3 times.

Goal

I'm trying to get the above query to perform better. Running this on even on a limited set of results takes ages. I wonder if there is a better way of writing this query.

Question 2

For next time, it may have been better to put the demo data from Entries etc. in a #TempTable instead of a giant CTE with UNION ALL.

Question 3

I'm with @Phrancis. You'd get better performance from a TempTable as they have indexes and statistics. CTEs are good for sequential reading. Reading: stackoverflow.com/questions/27894/…

Question 4

Please tell us a bit about the purpose of the query, so that we don't have to reverse-engineer your code. Edit the title accordingly — see How to Ask.

Question 5

I have rolled back the last edit. Please see what you may and may not do after receiving answers .

Question 6

That's fine Quill, I just added all the detail in the answer I was conducting.

Question 7

Before we begin with the code...

I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:

Difference between the Demo code and real life:

Items are in tables and not UNION ALL CTEs

The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.

Sequence column is actually a date.

While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:

Created temp tables #Users and #Entries including keys and indexes (clustered indexes created automatically on primary key constraints
N/A - cannot produce 800K rows of demo data
Changed sequence column to DATETIME type and seeded demo data using a RAND() formula with DATEADD(). While not completely identical, it should be "close enough".

New demo data:

----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
 UserID VARCHAR(100) NOT NULL,
 CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
 ItemID INT NOT NULL,
 UserID VARCHAR(100) NULL,
 Sequence DATETIME NOT NULL,
 CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
 CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
 SELECT 'A' UNION ALL
 SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
 SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
 SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
 -- etc.
 SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data END

For reference to others looking at this, the result set after running the whole query with sample data is as follows:

results

Performance

This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.

Note: I will make changes mainly in formatting as we go along.

exec-plan-orig

1. Duplicated Index Scans

Both of those identical scans come from the Src CTE, which is called from 2 the other 2 CTEs separately. I was looking for a way to eliminate the left join in favor of an existence check, however due to needing the u.UserID in your IsInternal field we will have to keep this join. One possibility, if this kind of operation (checking whether a user is internal) is something that is done frequently in your code base, you may consider adding an IsInternal boolean/bit column in Entries so you could eliminate this join altogether from your code base when you need to check if an entry is internal.

I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.

WITH Src AS (
 SELECT
 Src_Ent.ItemID
 , Src_Ent.UserID
 , Src_Ent.Sequence
 /*If the user for this item sequence is not found in Users, we mark it as Internal.*/
 , (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
 FROM #Entries AS Src_Ent
 LEFT JOIN #Users AS Src_Usr
 ON Src_Usr.UserID = Src_Ent.UserID
)

Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read. #Entries AS Src_Ent was e and #Users AS Src_Usr was u. I also wrapped the CASE expression in round brackets to help isolate it visually from its alias. I added a bit of documentation to the CASE statement.

2. Sort #1 - `Src_UserID_Sub`

This Sort results from the GROUP BY clause of Src_UserID_Sub subquery. Unfortunately it's not possible to eliminate this expensive sort, as rows must be sorted prior to being grouped. It's possible that this would be less expensive if you used a temp table as suggested in step (1) if you had a clustered index, for example by making an artificial primary key such as a RowID INT IDENTITY(1,1) column on the temp table.

3. Sort #2 - `Src_UserID` with `ROW_NUMBER()`

This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised: The function 'ROW_NUMBER' must have an OVER clause with ORDER BY. It might be possible to do away with ROW_NUMBER(), but that might also be more harmful than beneficial, as it would likely require a loop or some other construct that is not very "SQL-ish", and in the end, the query optimizer can probably work better with this built-in function than if we rolled our own. So again, very little optimization possible. I did eliminate the SELECT * in favor of enumerating the columns, as it makes it easier to understand, and in general SELECT * should usually be avoided for a variety of reasons.

, Src_UserID AS (
 SELECT
 Src_UserID_Sub.ItemID
 , Src_UserID_Sub.IsInternal
 , Src_UserID_Sub.UserID
 , Src_UserID_Sub.CountPerUser
 , Src_UserID_Sub.FirstUserSequence
 , ROW_NUMBER() OVER (
 PARTITION BY 
 Src_UserID_Sub.ItemID
 , Src_UserID_Sub.IsInternal
 ORDER BY 
 Src_UserID_Sub.FirstUserSequence
 ) AS [RowCount]
 FROM (
 /*This subquery is used to get the number of entries per user, 
 as well as the earliest sequence related to said entries*/
 SELECT
 Src.ItemID
 , Src.IsInternal
 , Src.UserID
 , COUNT(*) AS CountPerUser
 , MIN(Src.Sequence) AS FirstUserSequence
 FROM Src
 GROUP BY Src.ItemID, Src.IsInternal, Src.UserID
 ) AS Src_UserID_Sub

Improvements to formatting: Changed subquery alias from Src_UserID to Src_UserID_Sub to differentiate it from the CTE name and therefore make the code less ambiguous. Got rid of SELECT * as mentioned above. Added a small amount of documentation explaining what the subquery is for.

4. Hash Match

So here is the most expensive operation in your whole execution, at 31% operator cost. I'm going to quote one of the pros on DBA.StackExchange:

The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins.

Now, the thing is with hash joins, they are not necessarily slow, but they can be slow, depending on the memory load of the server at the time it is being executed. performance can also vary wildly based on the size of the build input vs. the amount of memory available, as the SQL optimizer will attempt to hold the hash table in memory, if it can. If it cannot due to insufficient memory, then it has to resort to more complex constructs such as Grace Hash Join and Recursive Hash Join.

The type of hash join that is used is not easily discerned when optimizing, as this is done dynamically. Per TechNet page on "Understanding Hash Joins":

It is not always possible during optimization to determine which hash join is used. Therefore, SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.

This one being less predictable, you will have to benchmark different solutions and compare the results. Would using temp tables instead of CTEs help? Maybe. Maybe not. Only way to know for sure is trying it in your environment.

SELECT
 items.ItemID
 , items.FirstSequence
 , internal.firstInternal
 , internal.FirstInternalSequence
 , internal.CountPerFirstInternal
 , items.CountAllInternal
 , items.CountAll
 , items.LastSequence
FROM Src_Items AS items
JOIN Src_FirstInternal AS internal
 ON internal.ItemID = items.ItemID
;

Formatting improvements: changed aliases as such: s0 -> items and s1 -> internal.

Overall

I think overall your SQL code is quite well written. From the looks of it, this probably belongs in a stored procedure. If it doesn't, then maybe you should make it so, that would give you further performance improvement by saving the execution plan after first execution.

Question 8

WOW, Thanks @Phrancis! I appreciate the time and thought you put into this. Just for clarification - you say use temp table instead of CTE, I assume you mean the Src CTE. I'll try that.

Question 9

@Lockszmith I was thinking a #TempTable that you just drop at the end. I think this would help you offset memory cost to an extent by using some disk instead. But like I said, under different loads and on different machines, only testing can tell!

Question 10

I finally figured it out, still keeping this as "The Answer" though. I'll post a full answer in the coming days, when I have some more time. My solution involves doing the first query with GROUPING SETS, the main CTE feeds other CTEs which using the GROUPING values to create different sets of the data. This allows a single sort operation on the main dataset, and then just plucking what's relevant within, I'm down from never completing (I've usually stopped it after 20 minutes) to 15 seconds.

Question 11

That's really awesome, glad you got it improved that much!

Question 12

I'd like to start by thanking Phrancis for his time, and detailed answer.

A while after I got the answer I finally revisited the issue with a clear head and I solved the solution to my satisfaction.

The trick here was to reduce the amount of SORT operation the query was doing. With how it was currently setup, it was sorting nested queries, which is very expensive, since this is in effect a recursive operation.

SQL helps us with GROUPING SETS where it can perform a single sort operation, and provides different aspects of the grouping, identified by the result of GROUPING() function.

By utilizing this on the source query, I was able to reduce the cost to a single SORT operation performed on the original data set, and then just 'slicing' different sub-sections of the resulted dataset (the main CTE), into secondary CTEs.

This proved to be very effective. Turning a query that would not complete even after 20 minutes, to one that runs for a few seconds on the entire data-set.

I'm posting here in the hopes this proves helpful to others looking at this question.

Instead of sorting in multiple CTEs, I'm performing a GROUPING SETS operation on the first CTE and then reusing it on following CTEs.

Below is the final code. Let me know if you have any questions regarding this implementation.

Demo Data (Complete As Temp-Tables)

----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
 UserID VARCHAR(100) NOT NULL,
 CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
 ItemID INT NOT NULL,
 UserID VARCHAR(100) NULL,
 Sequence DATETIME NOT NULL,
 CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
 CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
 SELECT 'A'
 UNION ALL SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
 SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10001, 'A', DATEADD(MINUTE, (1 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10001, 'B', DATEADD(MINUTE, (1 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10001, 'X', DATEADD(MINUTE, (1 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10001, 'A', DATEADD(MINUTE, (1 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
 UNION ALL
 SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'A', DATEADD(MINUTE, (2 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'Y', DATEADD(MINUTE, (2 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 7, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10020, 'B', DATEADD(MINUTE, (2 * 60) + 8, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
 UNION ALL
 SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 1, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 2, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 3, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 4, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 5, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'Z', DATEADD(MINUTE, (3 * 60) + 6, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE())) UNION ALL
 SELECT 10300, 'A', DATEADD(MINUTE, (3 * 60) + 7, DATEADD( MINUTE, -DATEPART(MINUTE, GETDATE()), GETDATE()))
GO
----- Demo source data END

CODE

WITH Src AS (
 SELECT
 e.ItemID
 , e.UserID
 , e.IsInternal
 , MIN(e.Sequence) minSequence
 , MAX(e.Sequence) maxSequence
 , COUNT(e.Sequence) _Count
 , GROUPING(e.UserID) gUserID
 , GROUPING(e.IsInternal) gIsInternal
 FROM (
 SELECT
 e.ItemID
 , CASE WHEN u.UserID IS NULL THEN 0 ELSE 1 END AS IsInternal
 , e.UserID
 , e.Sequence
 FROM #Entries AS e
 LEFT JOIN #Users as u
 ON u.UserID = e.UserID
 ) AS e
 GROUP BY GROUPING SETS ((e.ItemID), (e.ItemID, e.IsInternal), (e.ItemID, e.IsInternal, e.UserID))
)
, Src_1stUsers AS (
 SELECT
 Src.ItemID
 , Src.UserID FirstInternal
 , Src.minSequence FirstInternalSequence
 , Src._Count CountPerFirstInternal
 FROM Src
 INNER JOIN (
 SELECT
 _Src.ItemID
 , MIN(minSequence) minSequence
 FROM Src AS _Src
 WHERE
 _Src.gUserID = 0
 AND _Src.IsInternal = 1
 AND _Src.UserID is not NULL
 GROUP BY _Src.ItemID
 ) AS Src1
 ON Src.ItemID = Src1.ItemID
 AND Src.minSequence = Src1.minSequence
 WHERE
 Src.UserID is not NULL
),
Src_Item AS (
 SELECT
 Src.ItemID
 , Src.minSequence FirstSequence
 , Src.maxSequence LastSequence
 , Src._Count CountAll
 FROM Src
 WHERE Src.gUserID = 1 AND Src.gIsInternal = 1
 ),
Src_Internal AS (
 SELECT
 Src.ItemID
 , Src.minSequence FirstInternalSequence
 , Src._Count CountAllInternal
 FROM Src
 WHERE Src.gUserID = 1 AND Src.gIsInternal = 0 AND Src.IsInternal = 1
 )
SELECT
 --* FROM Src 
 i.ItemID
 , sItm.FirstSequence
 , s1st.FirstInternal
 , s1st.FirstInternalSequence
 , s1st.CountPerFirstInternal
 , sInt.CountAllInternal
 , sItm.CountAll
 , sItm.LastSequence
 FROM (
 SELECT DISTINCT ItemID FROM Src
 ) AS i
 LEFT JOIN Src_1stUsers AS s1st
 ON i.ItemID = s1st.ItemID
 LEFT JOIN Src_Item as sItm
 ON i.ItemID = sItm.ItemID 
 LEFT JOIN Src_Internal AS sInt
 ON i.ItemID = sInt.ItemID
GO

With the demo data the effect isn't noticeable, but with the real dataset, by the 3rd Step 99% of the query time has past.

Execution Plan of Demo Data execution plan of demo data

Execution Plan of REAL data set * execution plan of real data

* The real query doesn't use the SELECT DISTINCT, but a list of items that already exists in the database.

Phrancis Phrancis 20.5k6 gold badges69 silver badges155 bronze badges · Accepted Answer · 2015-12-17 19:02:28Z

Before we begin with the code...

I just want to address one thing regarding test cases with sample data. To get the best out of a performance review of your queries, try to provide a sample that's as close as possible to your real data. You stated:

Difference between the Demo code and real life:

Items are in tables and not UNION ALL CTEs

The list represented by Entries in the demo, in reality is 800K rows, and takes 8 seconds to retrieve.

Sequence column is actually a date.

While (2) would be difficult to replicate on a small scale, (1) and (3) are fairly simple. I modified your sample data in the following ways to match your real life data more closely:

Created temp tables #Users and #Entries including keys and indexes (clustered indexes created automatically on primary key constraints
N/A - cannot produce 800K rows of demo data
Changed sequence column to DATETIME type and seeded demo data using a RAND() formula with DATEADD(). While not completely identical, it should be "close enough".

New demo data:

----- Demo source data BEGIN
IF OBJECT_ID('tempdb..#Users') IS NOT NULL DROP TABLE #Users;
IF OBJECT_ID('tempdb..#Entries') IS NOT NULL DROP TABLE #Entries;
GO
CREATE TABLE #Users (
 UserID VARCHAR(100) NOT NULL,
 CONSTRAINT PK_#Users PRIMARY KEY (UserID)
);
CREATE TABLE #Entries (
 ItemID INT NOT NULL,
 UserID VARCHAR(100) NULL,
 Sequence DATETIME NOT NULL,
 CONSTRAINT PK_#Entries PRIMARY KEY (ItemID, Sequence),
 CONSTRAINT FK_#Users FOREIGN KEY (UserID) REFERENCES #Users(UserID)
);
GO
INSERT INTO #Users (UserID)
 SELECT 'A' UNION ALL
 SELECT 'B' ;
INSERT INTO #Entries (ItemID, UserID, Sequence)
 SELECT 10001 ItemID, 'X' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
 SELECT 10001 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence UNION ALL
 -- etc.
 SELECT 10300 ItemID, 'A' AS UserID, DATEADD(HOUR, (RAND() * 1000), GETDATE()) AS Sequence ;
GO
----- Demo source data END

For reference to others looking at this, the result set after running the whole query with sample data is as follows:

results

Performance

This being the meat of your question, let's start by looking at our execution plan, which I ran based on the above sample data. I added markers 1-4 which caught my attention and will address individually.

Note: I will make changes mainly in formatting as we go along.

exec-plan-orig

1. Duplicated Index Scans

Both of those identical scans come from the Src CTE, which is called from 2 the other 2 CTEs separately. I was looking for a way to eliminate the left join in favor of an existence check, however due to needing the u.UserID in your IsInternal field we will have to keep this join. One possibility, if this kind of operation (checking whether a user is internal) is something that is done frequently in your code base, you may consider adding an IsInternal boolean/bit column in Entries so you could eliminate this join altogether from your code base when you need to check if an entry is internal.

I cannot tell you exactly how to optimize that CTE otherwise, but since you are scanning the same source data sets twice, perhaps consider storing the result set inside a temp table, which presumably might be a smaller set than the entire two original tables.

WITH Src AS (
 SELECT
 Src_Ent.ItemID
 , Src_Ent.UserID
 , Src_Ent.Sequence
 /*If the user for this item sequence is not found in Users, we mark it as Internal.*/
 , (CASE WHEN Src_Usr.UserID IS NULL THEN 0 ELSE 1 END) AS IsInternal
 FROM #Entries AS Src_Ent
 LEFT JOIN #Users AS Src_Usr
 ON Src_Usr.UserID = Src_Ent.UserID
)

Improvements to formatting: Changed table aliases to make query (and execution plan) easier to read. #Entries AS Src_Ent was e and #Users AS Src_Usr was u. I also wrapped the CASE expression in round brackets to help isolate it visually from its alias. I added a bit of documentation to the CASE statement.

2. Sort #1 - `Src_UserID_Sub`

This Sort results from the GROUP BY clause of Src_UserID_Sub subquery. Unfortunately it's not possible to eliminate this expensive sort, as rows must be sorted prior to being grouped. It's possible that this would be less expensive if you used a temp table as suggested in step (1) if you had a clustered index, for example by making an artificial primary key such as a RowID INT IDENTITY(1,1) column on the temp table.

3. Sort #2 - `Src_UserID` with `ROW_NUMBER()`

This Sort is also impossible to eliminate with your current logic, otherwise the following error will be raised: The function 'ROW_NUMBER' must have an OVER clause with ORDER BY. It might be possible to do away with ROW_NUMBER(), but that might also be more harmful than beneficial, as it would likely require a loop or some other construct that is not very "SQL-ish", and in the end, the query optimizer can probably work better with this built-in function than if we rolled our own. So again, very little optimization possible. I did eliminate the SELECT * in favor of enumerating the columns, as it makes it easier to understand, and in general SELECT * should usually be avoided for a variety of reasons.

, Src_UserID AS (
 SELECT
 Src_UserID_Sub.ItemID
 , Src_UserID_Sub.IsInternal
 , Src_UserID_Sub.UserID
 , Src_UserID_Sub.CountPerUser
 , Src_UserID_Sub.FirstUserSequence
 , ROW_NUMBER() OVER (
 PARTITION BY 
 Src_UserID_Sub.ItemID
 , Src_UserID_Sub.IsInternal
 ORDER BY 
 Src_UserID_Sub.FirstUserSequence
 ) AS [RowCount]
 FROM (
 /*This subquery is used to get the number of entries per user, 
 as well as the earliest sequence related to said entries*/
 SELECT
 Src.ItemID
 , Src.IsInternal
 , Src.UserID
 , COUNT(*) AS CountPerUser
 , MIN(Src.Sequence) AS FirstUserSequence
 FROM Src
 GROUP BY Src.ItemID, Src.IsInternal, Src.UserID
 ) AS Src_UserID_Sub

Improvements to formatting: Changed subquery alias from Src_UserID to Src_UserID_Sub to differentiate it from the CTE name and therefore make the code less ambiguous. Got rid of SELECT * as mentioned above. Added a small amount of documentation explaining what the subquery is for.

4. Hash Match

So here is the most expensive operation in your whole execution, at 31% operator cost. I'm going to quote one of the pros on DBA.StackExchange:

The hash join is one of the more expensive join operations, as it requires the creation of a hash table to do the join. That said, it’s the join that’s best for large, unsorted inputs. It is the most memory-intensive of any of the joins.

Now, the thing is with hash joins, they are not necessarily slow, but they can be slow, depending on the memory load of the server at the time it is being executed. performance can also vary wildly based on the size of the build input vs. the amount of memory available, as the SQL optimizer will attempt to hold the hash table in memory, if it can. If it cannot due to insufficient memory, then it has to resort to more complex constructs such as Grace Hash Join and Recursive Hash Join.

The type of hash join that is used is not easily discerned when optimizing, as this is done dynamically. Per TechNet page on "Understanding Hash Joins":

It is not always possible during optimization to determine which hash join is used. Therefore, SQL Server starts by using an in-memory hash join and gradually transitions to grace hash join, and recursive hash join, depending on the size of the build input.

This one being less predictable, you will have to benchmark different solutions and compare the results. Would using temp tables instead of CTEs help? Maybe. Maybe not. Only way to know for sure is trying it in your environment.

SELECT
 items.ItemID
 , items.FirstSequence
 , internal.firstInternal
 , internal.FirstInternalSequence
 , internal.CountPerFirstInternal
 , items.CountAllInternal
 , items.CountAll
 , items.LastSequence
FROM Src_Items AS items
JOIN Src_FirstInternal AS internal
 ON internal.ItemID = items.ItemID
;

Formatting improvements: changed aliases as such: s0 -> items and s1 -> internal.

Overall

I think overall your SQL code is quite well written. From the looks of it, this probably belongs in a stored procedure. If it doesn't, then maybe you should make it so, that would give you further performance improvement by saving the execution plan after first execution.

WOW, Thanks @Phrancis! I appreciate the time and thought you put into this. Just for clarification - you say use temp table instead of CTE, I assume you mean the Src CTE. I'll try that.
@Lockszmith I was thinking a #TempTable that you just drop at the end. I think this would help you offset memory cost to an extent by using some disk instead. But like I said, under different loads and on different machines, only testing can tell!
I finally figured it out, still keeping this as "The Answer" though. I'll post a full answer in the coming days, when I have some more time. My solution involves doing the first query with GROUPING SETS, the main CTE feeds other CTEs which using the GROUPING values to create different sets of the data. This allows a single sort operation on the main dataset, and then just plucking what's relevant within, I'm down from never completing (I've usually stopped it after 20 minutes) to 15 seconds.

Stack Exchange Network

Query to count interactions between users

2 Answers 2

Before we begin with the code...

Performance

1. Duplicated Index Scans

2. Sort #1 - `Src_UserID_Sub`

3. Sort #2 - `Src_UserID` with `ROW_NUMBER()`

4. Hash Match

Overall

Demo Data (Complete As Temp-Tables)

CODE

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Query to count interactions between users

2 Answers 2

Before we begin with the code...

Performance

1. Duplicated Index Scans

2. Sort #1 - Src_UserID_Sub

3. Sort #2 - Src_UserID with ROW_NUMBER()

4. Hash Match

Overall

Demo Data (Complete As Temp-Tables)

CODE

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

2. Sort #1 - `Src_UserID_Sub`

3. Sort #2 - `Src_UserID` with `ROW_NUMBER()`