How to change aggregate function without duplicating SQL (by using SQL)

Question 1

In SQL Server 2016 I have a scenario where data will be processed according to different aggregation functions in a large GROUP BY ROLLUP. I would like to have a stored procedure that has a parameter that specifies which aggregation function to use to describe the groupings in a way that does not risk SQL injection and takes advantage of compilation (it is a heavy stored procedure).

My thoughts are to use a collection of queries that summarize the data's groupings on a particular aggregate function. (e.g. agg.DataMin, agg.DataMedian, agg.DataWeightedAverage, and so on). Then use these with the parameter in a CTE

WITH AggData AS
(
 SELECT * FROM agg.DataMin WHERE @AggFunction = 1 
 UNION ALL
 SELECT * FROM agg.DataMedian WHERE @AggFunction = 2 
 UNION ALL
 SELECT * FROM agg.DataWeightedAverage WHERE @AggFunction = 3
)
SELECT ...

My concerns are query performance and industry best practice. The data table is of a reasonable size (2+ Gig). I will have to add many aggregate queries with some being inline table-valued functions for some leave-out aggregations.

In the above, will the queries/table-valued functions only execute when the @AggFunction matches the WHERE condition or will they all execute and filter after the results are returned? If the latter, is there a method to short-circuit the evaluation of the unneeded queries at run-time? Also, is there some standard method to perform this in SQL that I have overlooked?

Question 2

What benefits of compilation do you think a stored procedure has that ad hoc SQL doesn't? Also you can control your own risk of SQL injection. Just having dynamic SQL does not automatically mean you are at risk.

Question 3

@AaronBertrand Compilation for performance. The environment cannot contain scenarios that could result in SQL injection at any level.

Question 4

What does "Compilation for performance" mean? Just to be clear: the statement in a stored procedure is "compiled" exactly the same way as an ad hoc statement is "compiled" - this is nothing at all like compilation of OO code. Not that this should change your approach - whether you use a stored procedure or not, you can use dynamic SQL, or not. And your code that constructs a dynamic SQL statement - whether it is in a stored procedure or not - can certainly validate that whatever expression is in @AggFunction is an expression that your code expects.

Question 5

Also if your stored procedure simply makes a decision like

SET @sql += CASE @AggFunction WHEN 1 THEN N'agg.DataMin' WHEN 2 THEN N'agg.DataMedian' WHEN 3 THEN N'agg.DataWeightedAverage' ELSE NULL END

I'm not sure how this is vulnerable to SQL injection.

Question 6

This post doesn't talk about compilation of stored procedures vs. ad hoc SQL (which I think is largely misunderstood here) but it certainly addresses the benefits of paying for compilation slightly more often to deal with parameter and parameter value variance : blogs.sqlsentry.com/aaronbertrand/…

Question 7

Contradiction Detection could kick in to make sure only one of the statements is run, and in my simple test it did as long as there was a statement-level recompile hint, but why risk it? For example:

USE tempdb
GO
-- CREATE SCHEMA agg
--DROP TABLE agg.DataMin
--DROP TABLE agg.DataMedian
--DROP TABLE agg.DataWeightedAverage
--GO
CREATE TABLE agg.DataMin ( x INT PRIMARY KEY )
CREATE TABLE agg.DataMedian ( x INT PRIMARY KEY )
CREATE TABLE agg.DataWeightedAverage ( x INT PRIMARY KEY )
GO
INSERT INTO agg.DataMin ( x )
SELECT object_id FROM sys.all_objects
INSERT INTO agg.DataMedian ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'P'
INSERT INTO agg.DataWeightedAverage ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'X'
GO
-- Are there some situations when it wouldn't...
DECLARE @AggFunction INT = 1
;WITH AggData AS
(
 SELECT * FROM agg.DataMin WHERE @AggFunction = 1 
 UNION ALL
 SELECT * FROM agg.DataMedian WHERE @AggFunction = 2 
 UNION ALL
 SELECT * FROM agg.DataWeightedAverage WHERE @AggFunction = 3
)
SELECT *
FROM AggData
OPTION ( RECOMPILE )

My results: Recompile in action

In this simple example, only one table is scanned on the left with the recompile, and 3 tables are scanned on the right, without the recompile. The recompile hint allows the optimizer to "see" the parameter value and act accordingly. In a stored procedure where parameter sniffing would be used, a recompile would also be needed to get the same behaviour, either at statement or stored-proc level.

However I cannot say if there are no situations where contradiction detection would not occur; and you can't prove a negative. To put it another way, I cannot prove contradiction detection would always occur even with a recompile. There may be some unknown situations where even with a recompile it does not occur; excessive complexity springs to mind.

Also, there is no real advantage to using the CTE in your example, so why not keep it simple? You could just write some simple procedural SQL with IF...THEN...ELSE which would guarantee only one of your statements would fire, eg

DECLARE @AggFunction INT = 99
IF @AggFunction = 1
 SELECT * FROM agg.DataMin
ELSE IF @AggFunction = 2
 SELECT * FROM agg.DataMedian
ELSE IF @AggFunction = 3
 SELECT * FROM agg.DataWeightedAverage
ELSE
 RAISERROR( 'Unknown value for parameter @AggFunction (%i).', 16, 1, @AggFunction )

Add some parameter checking while you're at it. Hopefully this meets your requirements of guaranteeing only one statement is compiled when needed, is safe and hopefully simple to implement.

HTH

Question 8

I suspect that those filters will have startup predicates and only one branch was in fact scanned.

Question 9

If used in a larger query though this pattern might not always get optimised like that and even if it does the cardinality estimates will likely be less accurate as it is the same plan irrespective of which branch will be executed.

Question 10

@Martin Yes, that's exactly why I'm a big fan of dynamic SQL for this. Compile a plan for each possible branch, that compilation overhead will be worth it in the log run. And if parameter variance or data skew leads to parameter sniffing issues, you can always compile every time, too.

Question 11

@wBob If I use a CREATE TYPE TABLE with the IFs and embed that into the GROUP BY ROLLUP will I get the benefit of a precompiled sproc? All the result sets from the aggregated views/table-valued functions will have the same schema.

Question 12

All procs must be compiled before execution. Some sections of the proc may compile separately (eg dynamic SQL) and some sections may recompile (eg triggered by schema change, forced recompile). Table types basically behave like table variables so watch out for those estimated rowcounts of 1. This may not matter if you're not using them in joins. However I can't see the value of using them here; if you're inserting into the type to present elsewhere, why not just present it? I worked up an example here, see if it helps.

wBob wBob 10.4k2 gold badges26 silver badges44 bronze badges · Accepted Answer · 2016-06-06 16:28:24Z

Contradiction Detection could kick in to make sure only one of the statements is run, and in my simple test it did as long as there was a statement-level recompile hint, but why risk it? For example:

USE tempdb
GO
-- CREATE SCHEMA agg
--DROP TABLE agg.DataMin
--DROP TABLE agg.DataMedian
--DROP TABLE agg.DataWeightedAverage
--GO
CREATE TABLE agg.DataMin ( x INT PRIMARY KEY )
CREATE TABLE agg.DataMedian ( x INT PRIMARY KEY )
CREATE TABLE agg.DataWeightedAverage ( x INT PRIMARY KEY )
GO
INSERT INTO agg.DataMin ( x )
SELECT object_id FROM sys.all_objects
INSERT INTO agg.DataMedian ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'P'
INSERT INTO agg.DataWeightedAverage ( x )
SELECT object_id FROM sys.all_objects WHERE type = 'X'
GO
-- Are there some situations when it wouldn't...
DECLARE @AggFunction INT = 1
;WITH AggData AS
(
 SELECT * FROM agg.DataMin WHERE @AggFunction = 1 
 UNION ALL
 SELECT * FROM agg.DataMedian WHERE @AggFunction = 2 
 UNION ALL
 SELECT * FROM agg.DataWeightedAverage WHERE @AggFunction = 3
)
SELECT *
FROM AggData
OPTION ( RECOMPILE )

My results: Recompile in action

In this simple example, only one table is scanned on the left with the recompile, and 3 tables are scanned on the right, without the recompile. The recompile hint allows the optimizer to "see" the parameter value and act accordingly. In a stored procedure where parameter sniffing would be used, a recompile would also be needed to get the same behaviour, either at statement or stored-proc level.

However I cannot say if there are no situations where contradiction detection would not occur; and you can't prove a negative. To put it another way, I cannot prove contradiction detection would always occur even with a recompile. There may be some unknown situations where even with a recompile it does not occur; excessive complexity springs to mind.

Also, there is no real advantage to using the CTE in your example, so why not keep it simple? You could just write some simple procedural SQL with IF...THEN...ELSE which would guarantee only one of your statements would fire, eg

DECLARE @AggFunction INT = 99
IF @AggFunction = 1
 SELECT * FROM agg.DataMin
ELSE IF @AggFunction = 2
 SELECT * FROM agg.DataMedian
ELSE IF @AggFunction = 3
 SELECT * FROM agg.DataWeightedAverage
ELSE
 RAISERROR( 'Unknown value for parameter @AggFunction (%i).', 16, 1, @AggFunction )

Add some parameter checking while you're at it. Hopefully this meets your requirements of guaranteeing only one statement is compiled when needed, is safe and hopefully simple to implement.

HTH

I suspect that those filters will have startup predicates and only one branch was in fact scanned.
If used in a larger query though this pattern might not always get optimised like that and even if it does the cardinality estimates will likely be less accurate as it is the same plan irrespective of which branch will be executed.
@Martin Yes, that's exactly why I'm a big fan of dynamic SQL for this. Compile a plan for each possible branch, that compilation overhead will be worth it in the log run. And if parameter variance or data skew leads to parameter sniffing issues, you can always compile every time, too.
@wBob If I use a CREATE TYPE TABLE with the IFs and embed that into the GROUP BY ROLLUP will I get the benefit of a precompiled sproc? All the result sets from the aggregated views/table-valued functions will have the same schema.
All procs must be compiled before execution. Some sections of the proc may compile separately (eg dynamic SQL) and some sections may recompile (eg triggered by schema change, forced recompile). Table types basically behave like table variables so watch out for those estimated rowcounts of 1. This may not matter if you're not using them in joins. However I can't see the value of using them here; if you're inserting into the type to present elsewhere, why not just present it? I worked up an example here, see if it helps.

Stack Exchange Network

How to change aggregate function without duplicating SQL (by using SQL)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to change aggregate function without duplicating SQL (by using SQL)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions