0

I have a large SQL table with 100+ million records in SQL Server as follows:

+-----------+----------+----------+----------+----------+
|CustomerID |TransDate |Category | Num_Trans |Sum_Trans |
+-----------+----------+----------+----------+----------+
|457136432 |2022年12月31日|TAXI |18 |220.34 |
|863326783 |2022年12月31日|FOOD |76 |980.71 |
+-----------+----------+----------+----------+----------+

this table contains number of transactions and sum of transaction values for each category in a 6-month window prior to "TransDate" per Customer.

I need to calculate some percentile calculations for a reporting task (query below)

SELECT DISTINCT TransDate, Category, 
 , PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_LOWER_NUM
 , PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_MEDIAN_NUM
 , PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Num_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_UPPER_NUM
 , PERCENTILE_CONT (0.05) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_LOWER_SUM
 , PERCENTILE_CONT (0.5) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_MEDIAN_SUM
 , PERCENTILE_CONT (0.95) WITHIN GROUP (ORDER BY Sum_Trans) OVER (PARTITION BY TransDate, Category)
 AS P_UPPER_SUM
FROM large_tbale

with the following non-clustered index on the table

CREATE NONCLUSTERED INDEX MyIndex 
ON large_table (TransDate, Category) 
INCLUDE (Num_Trans, Sum_Trans);

unfortunately, the table creation takes over 30 minutes which is not acceptable for my application.

to boost the performance, I even partitioned the table based on "TransDate" and created index on the partitions. but the execution time does not change significantly by table partitioning either.

Is there anyway to speed up these calculations?

PS- looking at the execution plan, it seems the problem is the fact that for each percentile_cont calculation, the query needs to sort the records. Is it possible to somehow sort "Num_Trans" and "Sum_trans" columns once and then do percentile calculations?

asked Jan 25, 2023 at 14:13
3
  • The query you have has no WHERE, so it's going to result in a scan of the entire table/index. If that table has 100m+ rows, then the above is going to need scan all 100m rows in the index. After that, it's going to have to perform a (very) expensive DISTINCT operation. Commented Jan 25, 2023 at 14:17
  • so there's no possibility to improve performance? even with partitioning? Commented Jan 25, 2023 at 14:21
  • If you're looking to do a big analytical query like this on a few columns, take a look at using a columnstore index. Commented Jan 25, 2023 at 14:40

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.