Slow running aggregate of aggregate query

Question 1

I am looking for the total balance for all accounts that had a negative balance as of a specific point in time. I've already got the code below, and it works, but works horribly.

The transactions table contains 20 million rows. The query below takes about 30 seconds which isn't bad, but there's a second part in that I need to repeat @report_date for the first of every month since the inception of the database, which balloons the execution time to ~30 minutes.

This is on MSSQL2008 and I'm not getting any missing indexes warnings on the execution plan, but I suspect my problem is still with the indexes so I'm intentionally leaving them off of here.

CREATE TABLE transactions(
 transaction_id int,
 account_id int,
 department_id int,
 location_id int,
 post_date date,
 amount money
 );
SELECT
 t2.department_id,
 t2.location_id,
 SUM(t2.credit_balances)
FROM
 (
 SELECT
 t1.department_id,
 t1.location_id,
 t1.account_id,
 SUM(t1.amount) as credit_balances
 FROM
 transactions t1
 WHERE
 t1.post_date < @report_date
 GROUP BY
 t1.department_id,
 t1.location_id,
 t1.account_id
 HAVING
 SUM(t1.amount) < 0
 ) t2
GROUP BY
 t2.department_id,
 t2.location_id;

The table contains 38 distinct department_id,location_id combinations, and 4.5 million distinct department_id,location_id,account_id.

Question 2

How many rows in the table? How long does the query take?

Question 3

~20 million rows. The query takes about 30 seconds which isn't bad, but there's a second part in that I need to repeat @report_date for the first of every month since the inception of the database, which balloons the execution time to ~30 minutes.

Question 4

How many distinct dept,location combinations and dept,location,account combinations?

Question 5

38 records to the first, 4.5 million records for the second.

Question 6

Assuming the actual query you want optimising is the 30 minute one I'd probably consider working through it in an iterative way a month at a time from the start and making sure you have an index with post_date,department_id,location_id,account_id INCLUDE (amount) You'd need to maintain a temp table with clustered PK on department_id,location_id,account_id and a balance_to_date column. Then extract a month at a time and merge into that table. And calculate your monthly total for department_id, location_id

Question 7

As you are repeating this query for multiple months then you will be continually re-aggregating the same rows.

For example the rows in the first month will always be brought back by the t1.post_date < @report_date criteria so will be re-processed for every month.

To avoid this I'd probably consider working through it in an iterative way a month at a time from the start. Dependent on the volatility of historic data I might also consider storing the pre-calculated results in the database rather than re-calculating these each month.

To calculate this at run time you could create a temporary table with the following structure.

CREATE TABLE #balance
 (
 department_id INT NOT NULL,
 location_id INT NOT NULL,
 account_id INT NOT NULL,
 balance_to_date MONEY NOT NULL,
 PRIMARY KEY (department_id, location_id, account_id)
 );

You could also consider adding the following index on your transactions table

ALTER TABLE transactions
 ADD post_date_year_month AS (10000 * YEAR(post_date) + MONTH(post_date))
CREATE INDEX ix
 ON transactions(post_date_year_month, department_id, location_id, account_id)
 INCLUDE (amount)

Then extract a month at a time from transactions and merge into #balance (with a when matched then increment, when not matched insert).

The leading post_date_year_month column means that as long as you write the query sargably the extraction of each month can be done efficiently and the extracted rows for a month will be ordered by department_id, location_id, account_id making a merge join against #balance possible without a sort.

Whilst that could benefit this particular query you'd need to assess the utility of this index against your overall workload.

Then calculate the department_id, location_id totals from #balance (can leverage the PK order to avoid a sort) and store those somewhere and move onto the next month.

(Or possibly instead of #balance you could use a "temporary" permanent table balance and create an indexed view on that to avoid the separate explicit aggregation step and just copy the values straight from that before moving on)

Martin Smith Martin Smith 88.4k15 gold badges258 silver badges357 bronze badges · Accepted Answer · 2014-09-28 12:18:48Z

As you are repeating this query for multiple months then you will be continually re-aggregating the same rows.

For example the rows in the first month will always be brought back by the t1.post_date < @report_date criteria so will be re-processed for every month.

To avoid this I'd probably consider working through it in an iterative way a month at a time from the start. Dependent on the volatility of historic data I might also consider storing the pre-calculated results in the database rather than re-calculating these each month.

To calculate this at run time you could create a temporary table with the following structure.

CREATE TABLE #balance
 (
 department_id INT NOT NULL,
 location_id INT NOT NULL,
 account_id INT NOT NULL,
 balance_to_date MONEY NOT NULL,
 PRIMARY KEY (department_id, location_id, account_id)
 );

You could also consider adding the following index on your transactions table

ALTER TABLE transactions
 ADD post_date_year_month AS (10000 * YEAR(post_date) + MONTH(post_date))
CREATE INDEX ix
 ON transactions(post_date_year_month, department_id, location_id, account_id)
 INCLUDE (amount)

Then extract a month at a time from transactions and merge into #balance (with a when matched then increment, when not matched insert).

The leading post_date_year_month column means that as long as you write the query sargably the extraction of each month can be done efficiently and the extracted rows for a month will be ordered by department_id, location_id, account_id making a merge join against #balance possible without a sort.

Whilst that could benefit this particular query you'd need to assess the utility of this index against your overall workload.

Then calculate the department_id, location_id totals from #balance (can leverage the PK order to avoid a sort) and store those somewhere and move onto the next month.

(Or possibly instead of #balance you could use a "temporary" permanent table balance and create an indexed view on that to avoid the separate explicit aggregation step and just copy the values straight from that before moving on)

Stack Exchange Network

Slow running aggregate of aggregate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Slow running aggregate of aggregate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions