I need to set a floor on the rolling sum calculation. For example, with
PKID NumValue GroupID
----------------------------
1 -1 1
2 -2 1
3 5 1
4 -7 1
5 1 2
I would like to have:
PKID RollingSum GroupID
----------------------------- ## Explanation:
1 0 1 ## 0 - 1 < 0 => 0
2 0 1 ## 0 - 2 < 0 => 0
3 5 1 ## 0 + 5 > 0 => 5
4 0 1 ## 5 - 7 < 0 => 0
When adding a negative number will cause the sum to be negative, the limit will be activated to set the result as zero. Subsequent addition should be based on this adjusted value, instead of the original rolling sum.
The expected result should be achieved using addition. If the fourth number changes from -7 to -3, the fourth result should be 2 instead of 0
If a single sum can be provided rather than a few rolling numbers, it would also be acceptable. I can use stored procedures to implement a non-negative addition, but that would be too low-level.
The real life problem for this is that we record order placed as positive amount and cancelled as negative. Due to connectivity issues customers may click the cancel
button more than once, which will result in multiple negative values being recorded. When calculating our revenue, "zero" need to be a boundary for sales.
This business application is absolutely stupid, but there's nothing I can do about that. For this question, please only consider solution that can be used by a DBA.
I expect fifty rows per GroupID
at most.
-
For purely friendly comparison. I was able to accomplish this in PostgreSQL without using recursion. I also upvoted this question as it is an abnormally fun and clever SQL problem.Evan Carroll– Evan Carroll2017年07月08日 17:02:42 +00:00Commented Jul 8, 2017 at 17:02
2 Answers 2
Here's a recursive CTE example I came up with (which seems to work). Is uses Row_Number() OVER to create a sequence number with no gaps. I don't know how well it would perform with your data, but it's something to try.
--Set up demo data
IF OBJECT_ID('tempdb..#temp') IS NOT NULL
drop table #temp
go
create table #temp (PKID int, NumValue int, GroupID int)
insert into #temp values
(1,-1,1), (3,-2,1), (5,5,1), (7,-3,1), (9,1,2)
--here is the real code
;
with RowNumberAddedToTemp as
(
SELECT
ROW_NUMBER() OVER(ORDER BY PKID ASC) AS rn,
* from #temp
)
,x
AS (
SELECT PKID --Anchor row
,NumValue
,RunningTotal = CASE
WHEN NumValue < 0 --if initial value less than zero, make zero
THEN 0
ELSE NumValue
END
,GroupID
,rn
FROM RowNumberAddedToTemp
WHERE rn = 1
UNION ALL
SELECT y.PKID
,y.NumValue
,CASE
WHEN x.GroupID <> y.groupid --did GroupId change?
THEN CASE
WHEN y.NumValue < 0 --if value is less than zero, make zero
THEN 0
ELSE y.numvalue --start new groupid totals
END
WHEN x.RunningTotal + y.NumValue < 0 --If adding the current row makes the total < 0, make zero
THEN 0
ELSE x.RunningTotal + y.NumValue --Add to the running total for the current groupid
END
,y.Groupid
,y.rn
FROM x
INNER JOIN RowNumberAddedToTemp AS y ON y.rn = x.rn + 1
)
SELECT PKID
,Numvalue
,RunningTotal
,GroupID
FROM x
ORDER BY PKID
OPTION (MAXRECURSION 10000);
Below is a recursive solution similar to Scott Hodgin's answer, but it should be able to better take advantage of an index. That will improve performance depending on what your data looks like. I mocked up one million rows of data with 20000 groups. Each has 50 rows associated with it:
DROP TABLE IF EXISTS biz_application_problems;
CREATE TABLE dbo.biz_application_problems (
PKID BIGINT NOT NULL,
NumValue INT NOT NULL,
GroupID INT NOT NULL,
PRIMARY KEY (PKID)
);
INSERT INTO dbo.biz_application_problems WITH (TABLOCK)
SELECT TOP (1000000)
2 * ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
, CAST(CRYPT_GEN_RANDOM(1) AS INT) % 22 - 11
, 1 + (-1 + ROW_NUMBER() OVER (ORDER BY (SELECT NULL))) / 50
FROM master..spt_values t1
CROSS JOIN master..spt_values t2
OPTION (MAXDOP 1);
CREATE INDEX gotta_go_fast ON dbo.biz_application_problems (GroupID, PKID)
INCLUDE (NumValue);
The nonclustered index that I added to the table should allow the recursive part of the CTE to get the next row with just one index seek. It also means that the order of data in the table according to the PK
doesn't matter. The other trick here is to use ROW_NUMBER()
to efficiently get the next row. This is necessary because TOP
cannot be used in the recursive part of a CTE. Here's the query:
WITH rec_cte AS (
SELECT TOP 1
GroupID
, PKID
, CASE WHEN NumValue < 0 THEN 0 ELSE NumValue END RunningSum
, NumValue -- for validation purposes only
FROM dbo.biz_application_problems
ORDER BY GroupId, PKID
UNION ALL
SELECT t.GroupID
, t.PKID
, t.RunningSum
, t.NumValue
FROM
(
SELECT
b.GroupID
, b.PKID
, CASE WHEN ca.RunningSum < 0 THEN 0 ELSE ca.RunningSum END RunningSum
, b.NumValue
, ROW_NUMBER() OVER (ORDER BY b.GroupId, b.PKID) rn
FROM dbo.biz_application_problems b
INNER JOIN rec_cte r ON
(b.GroupID = r.GroupID AND b.PKID > r.PKID) OR b.GroupID > r.GroupID
CROSS APPLY (
SELECT CASE WHEN b.GroupID <> r.GroupID THEN 0 ELSE r.RunningSum END + b.NumValue
) ca (RunningSum)
) t
WHERE t.rn = 1
)
SELECT *
FROM rec_cte
OPTION (MAXRECURSION 0);
Here's a sample of the results:
╔═════════╦══════╦════════════╦══════════╗
║ GroupID ║ PKID ║ RunningSum ║ NumValue ║
╠═════════╬══════╬════════════╬══════════╣
║ 7 ║ 700 ║ 13 ║ -4 ║
║ 8 ║ 702 ║ 0 ║ -2 ║
║ 8 ║ 704 ║ 7 ║ 7 ║
║ 8 ║ 706 ║ 8 ║ 1 ║
║ 8 ║ 708 ║ 3 ║ -5 ║
║ 8 ║ 710 ║ 0 ║ -4 ║
║ 8 ║ 712 ║ 0 ║ -7 ║
║ 8 ║ 714 ║ 7 ║ 7 ║
║ 8 ║ 716 ║ 2 ║ -5 ║
║ 8 ║ 718 ║ 10 ║ 8 ║
║ 8 ║ 720 ║ 0 ║ -11 ║
║ 8 ║ 722 ║ 0 ║ -7 ║
╚═════════╩══════╩════════════╩══════════╝
On my machine, the code takes around 10 seconds to process one million rows. We can see index seeks that return just one row for each iteration:
I uploaded the actual plan here as well.
Explore related questions
See similar questions with these tags.