I have 3 tables: Room
, Conference
, and Participant
. Room
has many Conference
s, and Conference
has many Participant
s. I need my query to display the fields from Room
, as well as the number of associated Conferences
it has, and the sum of the number of associated Participant
s each Conference
has. Here's a cut-down version of the SELECT
query I wrote to get this info; first, I just selected the room ID:
SELECT TOP(1000)
rm.[Id]
FROM
[Room] rm
LEFT JOIN (
SELECT
conf.[Id] AS [ConferenceId],
MIN(conf.[Name]) AS [ConferenceName],
MIN(conf.[RoomId]) AS [RoomId],
COUNT(part.[Id]) AS CalcConferenceParticipantCount
FROM
[Conference] conf
LEFT JOIN
[Participant] part on part.[ConferenceId] = conf.[Id]
GROUP BY
conf.[Id]
) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
rm.[Id]
This was very fast as SQL Server was able to just pull the data from Room
and pretty much ignore the subquery (see Trial 1 - Trial 4 in image below). Then I added in the ConferenceName
field from the subquery, as well as a count of the number of conferences per room:
SELECT TOP(1000)
rm.[Id],
COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
MIN(confData.[ConferenceName])
FROM
[Room] rm
LEFT JOIN (
SELECT
conf.[Id] AS [ConferenceId],
MIN(conf.[Name]) AS [ConferenceName],
MIN(conf.[RoomId]) AS [RoomId],
COUNT(part.[Id]) AS CalcConferenceParticipantCount
FROM
[Conference] conf
LEFT JOIN
[Participant] part on part.[ConferenceId] = conf.[Id]
GROUP BY
conf.[Id]
) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
rm.[Id]
This slowed down the query quite a bit, by a factor of about 100 (see Trial 5 - Trial 7 in image below). I then added in the participant count from the subquery, meaning there were 2 levels of aggregate functions being used:
SELECT TOP(1000)
rm.[Id],
COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
MIN(confData.[ConferenceName]),
SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM
[Room] rm
LEFT JOIN (
SELECT
conf.[Id] AS [ConferenceId],
MIN(conf.[Name]) AS [ConferenceName],
MIN(conf.[RoomId]) AS [RoomId],
COUNT(part.[Id]) AS CalcConferenceParticipantCount
FROM
[Conference] conf
LEFT JOIN
[Participant] part on part.[ConferenceId] = conf.[Id]
GROUP BY
conf.[Id]
) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
rm.[Id]
This further slowed down the query by a factor of about 4 (see Trial 8 - Trial 10 in image below). Here's the client statistics with data on the 10 trials:
Here's the query plan of the slow query: https://www.brentozar.com/pastetheplan/?id=SJpyeec5Q
Is there a way I can make this kind of query - where I calculate an aggregate of a subquery's aggregate - more efficient?
1 Answer 1
I mocked up data by looking at the row counts in your tables, giving them an even data distribution, and making guesses about the schema:
DROP TABLE IF EXISTS [Room];
CREATE TABLE [Room] (
[Id] BIGINT NOT NULL,
FILLER VARCHAR(200) NOT NULL,
PRIMARY KEY ([Id])
);
INSERT INTO [Room] WITH (TABLOCK)
SELECT TOP (3088) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)), REPLICATE('Z', 200)
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS [Conference];
CREATE TABLE [Conference] (
[Id] BIGINT NOT NULL,
[Name] VARCHAR(30) NOT NULL,
[RoomId] BIGINT NOT NULL,
FILLER VARCHAR(200) NOT NULL,
PRIMARY KEY ([Id])
);
INSERT INTO [Conference] WITH (TABLOCK)
SELECT RN
, 'MY FAVORITE MEETING ROOM'
, 1 + RN % 3088
, REPLICATE('Z', 200)
FROM
(
SELECT TOP (97413) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);
DROP TABLE IF EXISTS [Participant];
CREATE TABLE [Participant] (
[Id] BIGINT NOT NULL,
[ConferenceId] BIGINT NOT NULL,
FILLER VARCHAR(200) NOT NULL,
PRIMARY KEY ([Id])
);
INSERT INTO [Participant] WITH (TABLOCK)
SELECT RN
, 1 + RN % 97413
, REPLICATE('Z', 200)
FROM
(
SELECT TOP (235323) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) RN
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
) q
OPTION (MAXDOP 1);
CREATE INDEX NCI_Part ON [Participant] ([ConferenceId]) INCLUDE (Id);
The most important assumption that I made about the schema is that the Id
column is the primary key of the [Conference]
table. This seemed reasonable given the query plan and the index names involved.
On my machine I get the same query plan as you, but my starting query only takes 163 ms of CPU. I assume the differences come down to differences in hardware, data distribution, and the fact that I'm not returning data to the client.
The first thing that jumped out to me is the unnecessary GROUP BY
in your confData
derived table. Id
is the primary key of the table so you don't need all of the aggregates. With the right indexes (which you already have for this particular case), subqueries aren't necessarily a bad thing. Rewriting what you have to remove the GROUP BY
:
SELECT TOP(1000)
rm.[Id],
COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
MIN(confData.[ConferenceName]),
SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM
[Room] rm
LEFT JOIN (
SELECT
conf.[Id] AS [ConferenceId],
conf.[Name] AS [ConferenceName],
conf.[RoomId] AS [RoomId],
(
SELECT COUNT(part.[Id])
FROM [Participant] part
WHERE part.[ConferenceId] = conf.[Id]
) AS CalcConferenceParticipantCount
FROM
[Conference] conf
) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
rm.[Id]
OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
This results in the stream aggregate getting pushed down further into the plan:
The uploaded plan takes 113 ms of CPU. The same operators are present, but some of them process fewer rows which saves time. You may be able to make this query more efficient by defining a covering index on the [Conference]
with Id
as the index key. This may seem like an odd thing to do, but your clustered index scan takes 10% of the overall query time and likely includes columns that you don't need.
If you want to make the query faster you could also consider an indexed view. Why perform the aggregation every time when you can define a simple indexed view to do it for you?
CREATE VIEW IndexedViewOnParticipant WITH SCHEMABINDING
AS
SELECT [ConferenceId], COUNT_BIG([Id]) CntId, COUNT_BIG(*) Cnt
FROM dbo.[Participant]
GROUP BY [ConferenceId];
GO
CREATE UNIQUE CLUSTERED INDEX CI ON IndexedViewOnParticipant ([ConferenceId]);
This will result in a little more space and a little bit of overhead when doing DML on the table. Overall I'd say that it's a good use case for an indexed view. Rewriting the query again:
SELECT TOP(1000)
rm.[Id],
COUNT(confData.[ConferenceId]) AS CalcRoomConferenceCount,
MIN(confData.[ConferenceName]),
SUM(confData.[CalcConferenceParticipantCount]) AS CalcRoomParticipantCount
FROM
[Room] rm
LEFT JOIN (
SELECT
conf.[Id] AS [ConferenceId],
conf.[Name] AS [ConferenceName],
conf.[RoomId] AS [RoomId],
(
SELECT CntId
FROM IndexedViewOnParticipant part WITH (NOEXPAND)
WHERE part.[ConferenceId] = conf.[Id]
) AS CalcConferenceParticipantCount
FROM
[Conference] conf
) confData ON confData.[RoomId] = rm.[Id]
GROUP BY
rm.[Id]
OPTION (USE HINT('FORCE_LEGACY_CARDINALITY_ESTIMATION'));
SQL Server agrees with my assessment that it's a good idea and CPU time falls to 78 ms.
On my machine I was able to make the query even faster, but this is starting to get into optimizations that are somewhat risky because it may require a LOOP JOIN
hint. That hint may not be a good idea as your query or the data in the table changes. It also may not be a good fit for your hardware. The idea behind this approach is to create a suitable index on [Conference]
and to take full advantage of the TOP
with a plan that only does nested loops. Here is the index that I added:
CREATE INDEX NCI_Conf ON [Conference] ([RoomId]) INCLUDE ([Name]);
Running the same query as before with a LOOP JOIN
hint gave me the following plan:
That query only took 58 ms of CPU time. It's worth mentioning that I noticed requesting the actual plan adds quite a bit of relative overhead at this stage. All of the other possible optimizations that come to mind to me aren't safe for production so I'll stop here.
As a final thought, do you really want to return 1000 arbitrary rows and the minimum conference name? Is that information useful to your end users?
Explore related questions
See similar questions with these tags.