Am trying to calculate running total. But it should reset when the cummulative sum greater than another column value
create table #reset_runn_total
(
id int identity(1,1),
val int,
reset_val int,
grp int
)
insert into #reset_runn_total
values
(1,10,1),
(8,12,1),(6,14,1),(5,10,1),(6,13,1),(3,11,1),(9,8,1),(10,12,1)
SELECT Row_number()OVER(partition BY grp ORDER BY id)AS rn,*
INTO #test
FROM #reset_runn_total
Index details:
CREATE UNIQUE CLUSTERED INDEX ix_load_reset_runn_total
ON #test(rn, grp)
sample data
+----+-----+-----------+-----+
| id | val | reset_val | Grp |
+----+-----+-----------+-----+
| 1 | 1 | 10 | 1 |
| 2 | 8 | 12 | 1 |
| 3 | 6 | 14 | 1 |
| 4 | 5 | 10 | 1 |
| 5 | 6 | 13 | 1 |
| 6 | 3 | 11 | 1 |
| 7 | 9 | 8 | 1 |
| 8 | 10 | 12 | 1 |
+----+-----+-----------+-----+
Expected result
+----+-----+-----------------+-------------+
| id | val | reset_val | Running_tot |
+----+-----+-----------------+-------------+
| 1 | 1 | 10 | 1 |
| 2 | 8 | 12 | 9 | --1+8
| 3 | 6 | 14 | 15 | --1+8+6 -- greater than reset val
| 4 | 5 | 10 | 5 | --reset
| 5 | 6 | 13 | 11 | --5+6
| 6 | 3 | 11 | 14 | --5+6+3 -- greater than reset val
| 7 | 9 | 8 | 9 | --reset -- greater than reset val
| 8 | 10 | 12 | 10 | --reset
+----+-----+-----------------+-------------+
Query:
I got the result using Recursive CTE
. Original question is here https://stackoverflow.com/questions/42085404/reset-running-total-based-on-another-column
;WITH cte
AS (SELECT rn,id,
val,
reset_val,
grp,
val AS running_total,
Iif (val > reset_val, 1, 0) AS flag
FROM #test
WHERE rn = 1
UNION ALL
SELECT r.*,
Iif(c.flag = 1, r.val, c.running_total + r.val),
Iif(Iif(c.flag = 1, r.val, c.running_total + r.val) > r.reset_val, 1, 0)
FROM cte c
JOIN #test r
ON r.grp = c.grp
AND r.rn = c.rn + 1)
SELECT *
FROM cte
Is there any better alternative in T-SQL
without using CLR
.?
4 Answers 4
I've looked at similar problems and have never been able to find a window function solution that does a single pass over the data. I don't think it's possible. Window functions need to be able to be applied to all of the values in a column. That makes reset calculations like this very difficult, because one reset changes the value for all of the following values.
One way to think about the problem is that you can get the end result you want if you calculate a basic running total as long as you can subtract the running total from the correct previous row. For example, in your sample data the value for id
4 is the running total of row 4 - the running total of row 3
. The value for id
6 is the running total of row 6 - the running total of row 3
because a reset hasn't happened yet. The value for id
7 is the running total of row 7 - the running total of row 6
and so on.
I would approach this with T-SQL in a loop. I got a little carried away and think I have a full solution. For 3 million rows and 500 groups the code finished in 24 seconds on my desktop. I'm testing with SQL Server 2016 Developer edition with 6 vCPU. I'm taking advantage of parallel inserts and parallel execution in general so you may need to change the code if you're on an older version or have DOP limitations.
Below the code that I used to generate the data. The ranges on VAL
and RESET_VAL
should be similar to your sample data.
drop table if exists reset_runn_total;
create table reset_runn_total
(
id int identity(1,1),
val int,
reset_val int,
grp int
);
DECLARE
@group_num INT,
@row_num INT;
BEGIN
SET NOCOUNT ON;
BEGIN TRANSACTION;
SET @group_num = 1;
WHILE @group_num <= 50000
BEGIN
SET @row_num = 1;
WHILE @row_num <= 60
BEGIN
INSERT INTO reset_runn_total WITH (TABLOCK)
SELECT 1 + ABS(CHECKSUM(NewId())) % 10, 8 + ABS(CHECKSUM(NewId())) % 8, @group_num;
SET @row_num = @row_num + 1;
END;
SET @group_num = @group_num + 1;
END;
COMMIT TRANSACTION;
END;
The algorithm is as follows:
1) Start by inserting all rows with a standard running total into a temp table.
2) In a loop:
2a) For each group, calculate the first row with a running total above the reset_value remaining in the table and store the id, the running total that was too large, and the previous running total that was too large in a temp table.
2b) Delete rows from the first temp table into a results temp table that have an ID
less than or equal to the ID
in the second temp table. Use the other columns to adjust the running total as needed.
3) After the delete no longer processes rows run an additional DELETE OUTPUT
into the results table. This is for rows at the end of the group that never exceed the reset value.
I'll go through one implementation of the above algorithm in T-SQL step by step.
Start by creating a few temp tables. #initial_results
holds the original data with the standard running total, #group_bookkeeping
is updated each loop to figure out which rows can be moved, and #final_results
contains the results with the running total adjusted for resets.
CREATE TABLE #initial_results (
id int,
val int,
reset_val int,
grp int,
initial_running_total int
);
CREATE TABLE #group_bookkeeping (
grp int,
max_id_to_move int,
running_total_to_subtract_this_loop int,
running_total_to_subtract_next_loop int,
grp_done bit,
PRIMARY KEY (grp)
);
CREATE TABLE #final_results (
id int,
val int,
reset_val int,
grp int,
running_total int
);
INSERT INTO #initial_results WITH (TABLOCK)
SELECT ID, VAL, RESET_VAL, GRP, SUM(VAL) OVER (PARTITION BY GRP ORDER BY ID) RUNNING_TOTAL
FROM reset_runn_total;
CREATE CLUSTERED INDEX i1 ON #initial_results (grp, id);
INSERT INTO #group_bookkeeping WITH (TABLOCK)
SELECT DISTINCT GRP, 0, 0, 0, 0
FROM reset_runn_total;
I create the clustered index on the temp table after so the insert and the index build can be done in parallel. Made a big difference on my machine but may not on yours. Creating an index on the source table didn't seem to help but that could help on your machine.
The code below runs in the loop and updates the bookkeeping table. For each group, we need to get the find the maximum ID
which should be moved into the results table. We need the running total from that row so we can subtract it from the initial running total. The grp_done
column is set to 1 when there isn't any more work to do for a grp
.
WITH UPD_CTE AS (
SELECT
#grp_bookkeeping.GRP
, MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) max_id_to_update
, MIN(#group_bookkeeping.running_total_to_subtract_next_loop) running_total_to_subtract_this_loop
, MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN initial_running_total ELSE NULL END) additional_value_next_loop
, CASE WHEN MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) IS NULL THEN 1 ELSE 0 END grp_done
FROM #group_bookkeeping
INNER JOIN #initial_results IR ON #group_bookkeeping.grp = ir.grp
WHERE #group_bookkeeping.grp_done = 0
GROUP BY #group_bookkeeping.GRP
)
UPDATE #group_bookkeeping
SET #group_bookkeeping.max_id_to_move = uv.max_id_to_update
, #group_bookkeeping.running_total_to_subtract_this_loop = uv.running_total_to_subtract_this_loop
, #group_bookkeeping.running_total_to_subtract_next_loop = uv.additional_value_next_loop
, #group_bookkeeping.grp_done = uv.grp_done
FROM UPD_CTE uv
WHERE uv.GRP = #group_bookkeeping.grp
OPTION (LOOP JOIN);
Really not a fan of the LOOP JOIN
hint in general, but this is a simple query and it was the quickest way to get what I wanted. To really optimize for response time I wanted parallel nested loop joins instead of DOP 1 merge joins.
The code below runs in the loop and moves data from the initial table into the final results table. Notice the adjustment to the initial running total.
DELETE ir
OUTPUT DELETED.id,
DELETED.VAL,
DELETED.RESET_VAL,
DELETED.GRP ,
DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
INTO #final_results
FROM #initial_results ir
INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP AND ir.ID <= tb.max_id_to_move
WHERE tb.grp_done = 0;
For your convenience below is the full code:
DECLARE @RC INT;
BEGIN
SET NOCOUNT ON;
CREATE TABLE #initial_results (
id int,
val int,
reset_val int,
grp int,
initial_running_total int
);
CREATE TABLE #group_bookkeeping (
grp int,
max_id_to_move int,
running_total_to_subtract_this_loop int,
running_total_to_subtract_next_loop int,
grp_done bit,
PRIMARY KEY (grp)
);
CREATE TABLE #final_results (
id int,
val int,
reset_val int,
grp int,
running_total int
);
INSERT INTO #initial_results WITH (TABLOCK)
SELECT ID, VAL, RESET_VAL, GRP, SUM(VAL) OVER (PARTITION BY GRP ORDER BY ID) RUNNING_TOTAL
FROM reset_runn_total;
CREATE CLUSTERED INDEX i1 ON #initial_results (grp, id);
INSERT INTO #group_bookkeeping WITH (TABLOCK)
SELECT DISTINCT GRP, 0, 0, 0, 0
FROM reset_runn_total;
SET @RC = 1;
WHILE @RC > 0
BEGIN
WITH UPD_CTE AS (
SELECT
#group_bookkeeping.GRP
, MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) max_id_to_move
, MIN(#group_bookkeeping.running_total_to_subtract_next_loop) running_total_to_subtract_this_loop
, MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN initial_running_total ELSE NULL END) additional_value_next_loop
, CASE WHEN MIN(CASE WHEN initial_running_total - #group_bookkeeping.running_total_to_subtract_next_loop > RESET_VAL THEN ID ELSE NULL END) IS NULL THEN 1 ELSE 0 END grp_done
FROM #group_bookkeeping
CROSS APPLY (SELECT ID, RESET_VAL, initial_running_total FROM #initial_results ir WHERE #group_bookkeeping.grp = ir.grp ) ir
WHERE #group_bookkeeping.grp_done = 0
GROUP BY #group_bookkeeping.GRP
)
UPDATE #group_bookkeeping
SET #group_bookkeeping.max_id_to_move = uv.max_id_to_move
, #group_bookkeeping.running_total_to_subtract_this_loop = uv.running_total_to_subtract_this_loop
, #group_bookkeeping.running_total_to_subtract_next_loop = uv.additional_value_next_loop
, #group_bookkeeping.grp_done = uv.grp_done
FROM UPD_CTE uv
WHERE uv.GRP = #group_bookkeeping.grp
OPTION (LOOP JOIN);
DELETE ir
OUTPUT DELETED.id,
DELETED.VAL,
DELETED.RESET_VAL,
DELETED.GRP ,
DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
INTO #final_results
FROM #initial_results ir
INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP AND ir.ID <= tb.max_id_to_move
WHERE tb.grp_done = 0;
SET @RC = @@ROWCOUNT;
END;
DELETE ir
OUTPUT DELETED.id,
DELETED.VAL,
DELETED.RESET_VAL,
DELETED.GRP ,
DELETED.initial_running_total - tb.running_total_to_subtract_this_loop
INTO #final_results
FROM #initial_results ir
INNER JOIN #group_bookkeeping tb ON ir.GRP = tb.GRP;
CREATE CLUSTERED INDEX f1 ON #final_results (grp, id);
/* -- do something with the data
SELECT *
FROM #final_results
ORDER BY grp, id;
*/
DROP TABLE #final_results;
DROP TABLE #initial_results;
DROP TABLE #group_bookkeeping;
END;
-
simply awesome I will award you with bountyPரதீப்– Pரதீப்2017年02月09日 10:17:45 +00:00Commented Feb 9, 2017 at 10:17
-
In our server, for 50000 grp's and 60 id's yours took 1 minute and 10 seconds.
Recursive CTE
took 2 minutes and 15 secondsPரதீப்– Pரதீப்2017年02月09日 11:00:09 +00:00Commented Feb 9, 2017 at 11:00 -
I tested the both the codes with same data. Your's was awesome. Can it be further improved ?Pரதீப்– Pரதீப்2017年02月09日 14:48:39 +00:00Commented Feb 9, 2017 at 14:48
-
I meant, I ran your code on our real data and tested it. Calculation is processed in temp tables in my real procedure, most likely it should be tightly packed. It will be good if it can be reduced to anything around 30 secondsPரதீப்– Pரதீப்2017年02月09日 15:20:34 +00:00Commented Feb 9, 2017 at 15:20
-
@Prdp Tried a quick approach that used an update but it seemed to be worse. Won't be able to look into this more for a while. Try logging how long each operation takes so you can figure out which part is running the slowest on your server. It's definitely possible that there's a way to speed up this code or a better algorithm in general.Joe Obbish– Joe Obbish2017年02月09日 15:44:21 +00:00Commented Feb 9, 2017 at 15:44
Using a CURSOR:
ALTER TABLE #reset_runn_total ADD RunningTotal int;
DECLARE @id int, @val int, @reset int, @acm int, @grp int, @last_grp int;
SET @acm = 0;
DECLARE curRes CURSOR FAST_FORWARD FOR
SELECT id, val, reset_val, grp
FROM #reset_runn_total
ORDER BY grp, id;
OPEN curRes;
FETCH NEXT FROM curRes INTO @id, @val, @reset, @grp;
SET @last_grp = @grp;
WHILE @@FETCH_STATUS = 0
BEGIN
IF @grp <> @last_grp SET @acm = 0;
SET @last_grp = @grp;
SET @acm = @acm + @val;
UPDATE #reset_runn_total
SET RunningTotal = @acm
WHERE id = @id;
IF @acm > @reset SET @acm = 0;
FETCH NEXT FROM curRes INTO @id, @val, @reset, @grp;
END
CLOSE curRes;
DEALLOCATE curRes;
+----+-----+-----------+-------------+
| id | val | reset_val | RunningTotal|
+----+-----+-----------+-------------+
| 1 | 1 | 10 | 1 |
+----+-----+-----------+-------------+
| 2 | 8 | 12 | 9 |
+----+-----+-----------+-------------+
| 3 | 6 | 14 | 15 |
+----+-----+-----------+-------------+
| 4 | 5 | 10 | 5 |
+----+-----+-----------+-------------+
| 5 | 6 | 13 | 11 |
+----+-----+-----------+-------------+
| 6 | 3 | 11 | 14 |
+----+-----+-----------+-------------+
| 7 | 9 | 8 | 9 |
+----+-----+-----------+-------------+
| 8 | 10 | 12 | 10 |
+----+-----+-----------+-------------+
Check here: http://rextester.com/WSPLO95303
Not windowed, but pure SQL version:
WITH x AS (
SELECT TOP 1 id,
val,
reset_val,
val AS running_total,
1 AS level
FROM reset_runn_total
UNION ALL
SELECT r.id,
r.val,
r.reset_val,
CASE WHEN x.running_total < x.reset_val THEN x.running_total + r.val ELSE r.val END,
level = level + 1
FROM x JOIN reset_runn_total AS r ON (r.id > x.id)
) SELECT
*
FROM x
WHERE NOT EXISTS (
SELECT 1
FROM x AS x2
WHERE x2.id = x.id
AND x2.level > x.level
)
ORDER BY id, level DESC
;
I am not a specialist in SQL Server's dialect. This is an initial version for PostrgreSQL (if I understand correctly I can not use LIMIT 1 / TOP 1 in recursive part in SQL Server):
WITH RECURSIVE x AS (
(SELECT id, val, reset_val, val AS running_total
FROM reset_runn_total
ORDER BY id
LIMIT 1)
UNION
(SELECT r.id, r.val, r.reset_val,
CASE WHEN x.running_total < x.reset_val THEN x.running_total + r.val ELSE r.val END
FROM x JOIN reset_runn_total AS r ON (r.id > x.id)
ORDER BY id
LIMIT 1)
) SELECT * FROM x;
-
@JoeObbish to be honest, that is not entirely clear from the question. The expected results for example, show no
grp
column.ypercubeᵀᴹ– ypercubeᵀᴹ2017年02月11日 20:08:46 +00:00Commented Feb 11, 2017 at 20:08 -
@JoeObbish that's what I understood as well. yet, the question could benefit from an explicit statement about that. The code in the question (with the CTE) doesn't use it either (and it even has differently named columns). It would be obvious to anyone who reads the question - they wouldn't - and shouldn't - have to read the other answers or comments.ypercubeᵀᴹ– ypercubeᵀᴹ2017年02月11日 20:28:01 +00:00Commented Feb 11, 2017 at 20:28
-
@ypercubeTM Added required information on question.Pரதீப்– Pரதீப்2017年02月14日 11:02:04 +00:00Commented Feb 14, 2017 at 11:02
It seems you have several queries / methods to attack the problem but you haven't provided us - or even considered? - the indexes on the table.
What indexes are there in the table? Is it a heap or does it have a clustered index?
I would try the various solutions suggested after adding this index:
(grp, id) INCLUDE (val, reset_val)
Or just change (or make) the clustered index to be (grp, id)
.
Having an index that targets the specific query should improve the efficiency - of most if not all methods.
-
Added required information on question.Pரதீப்– Pரதீப்2017年02月13日 12:40:15 +00:00Commented Feb 13, 2017 at 12:40
Explore related questions
See similar questions with these tags.
50000
groups with60
Id's. so total count of records will be around3000000
. Am sureRecursive CTE
will not scale well on3000000
. Will update the metrics when I get back to office. Can we achieve this usingsum()Over(Order by)
like you have used in this article sqlperformance.com/2012/07/t-sql-queries/running-totals