2

How do I calculate the percentile of the nth element in a field relative to the ones previous to it in a time series for each subset (partition) of my data?

For example, this is exactly what I want to do, except it is a cumulative sum, not a percentile:

select * 
from 
(SELECT trade_date, daily_val, id, company_id, sum(daily_val)
 over (partition by company_id order by trade_date rows between unbounded preceding and current row) AS sum
FROM daily_data
where (company_id = 858 or company_id = 726) and wh_calc_id = 156) as x
order by company_id,trade_date

This is what I want to do:

select * 
from 
(SELECT trade_date, daily_val, id, company_id, percent_rank(daily_val)
 over (partition by company_id order by trade_date rows between unbounded preceding and current row) AS pctl
FROM daily_data
where (company_id = 858 or company_id = 726) and wh_calc_id = 156) as x
order by company_id,trade_date

(Note: the only difference between the two is sum() vs percent_rank())

It seems I need a percentile rank function available in PostgreSQL 9.5 that takes a sorted expression argument. I know it exists for more recent versions:
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

And the 9.3 version doesn't seem to take an argument:
https://www.postgresql.org/docs/9.3/static/functions-window.html

I really don't want to upgrade PostgreSQL as I already have a lot going on right now.

Here is some simplified example output of what I want (created in Excel):

trade_date daily_val pctl company_id
1/1/1900 1484.881621 1.000000 1
1/2/1900 1465.465315 0.000000 1
1/3/1900 1471.244094 0.500000 1
1/4/1900 1475.109578 0.666000 1
1/5/1900 1484.881621 0.750000 1
1/6/1900 1508.863507 1.000000 1
1/7/1900 1519.089431 1.000000 1
1/8/1900 1504.819339 0.714000 1
1/9/1900 1523.210769 1.000000 1
1/10/1900 1514.990335 0.777000 1
1/11/1900 1550.595515 1.000000 1
1/12/1900 1529.454907 0.909000 1
1/13/1900 1546.324863 0.916000 1
1/14/1900 1542.077672 0.846000 1
1/15/1900 1561.396941 1.000000 1
1/16/1900 1519.089431 0.533000 1
1/17/1900 1498.774337 0.312000 1
1/18/1900 1486.847801 0.294000 1
1/19/1900 1488.819195 0.333000 1
1/20/1900 1506.83871 0.473000 1
1/21/1900 1500.777505 0.400000 1
1/1/1900 149.113683 1.000000 5
1/2/1900 155.505015 1.000000 5
1/3/1900 154.15655 0.500000 5
1/4/1900 160.425942 1.000000 5
1/5/1900 156.29799 0.750000 5
1/6/1900 164.058078 1.000000 5
1/7/1900 159.935225 0.666000 5
1/8/1900 158.537455 0.571000 5
1/9/1900 154.147639 0.125000 5
1/10/1900 155.722962 0.444000 5
1/11/1900 155.505015 0.300000 5
1/12/1900 162.766282 0.909000 5
1/13/1900 169.264949 1.000000 5
1/14/1900 167.73179 0.923000 5
1/15/1900 162.280915 0.714000 5
1/16/1900 159.906452 0.533000 5
1/17/1900 161.04602 0.687000 5
1/18/1900 157.656379 0.411000 5
1/19/1900 154.478044 0.166000 5
1/20/1900 154.576546 0.210000 5
1/21/1900 154.147639 0.050000 5

ntile doesn't work either.

Erwin Brandstetter
186k28 gold badges463 silver badges636 bronze badges
asked Feb 10, 2017 at 21:37
3
  • I assume, by available in PostgreSQL 9.3.5 you really meant PostgreSQL 9.5 (as the manual link indicates)? Either way, hypothetical-set aggregate functions where introduced with 9.4. Commented Feb 11, 2017 at 8:57
  • 9.3 is what I have. I provided the link to 9.5 bc I thought it would be useful to people who have 9.5, so they they don't have do what I am doing. Didn't know it was available in 9.4. Commented Feb 11, 2017 at 13:24
  • Not yet. I am working on some infrastructure around this, so I can test this. If you look at my most recent question you will see how it relates. I will credit the answer as soon as I can test it. I look at this every day. Commented Feb 16, 2017 at 13:49

1 Answer 1

2

In Postgres 9.4+, you can use the hypothetical-set aggregate function percent_rank() in a LATERAL join like this:

SELECT *
FROM daily_data d, LATERAL (
 SELECT round(percent_rank(d.daily_val) WITHIN GROUP (ORDER BY daily_val)::numeric
 , 6) AS pctl_calc
 FROM daily_data
 WHERE company_id = d.company_id
 AND trade_date < d.trade_date
 ) x
ORDER BY company_id, trade_date;

The CROSS JOIN LATERAL (, LATERAL for short) is safe, because a subquery with an aggregate function always returns a row. (Else you might want LEFT JOIN LATERAL (...) x ON true.)

Minor difference: The first row in each group returns 0, not 1 like in your example. Easily fixed, if you need it.

This would do the same job in Postgres 9.3:

SELECT d.*
 , round(CASE WHEN x.ct = 0 THEN numeric '1' -- or 0 ?
 ELSE x.ct_lt / x.ct END, 6) AS pctl_calc
FROM daily_data d, LATERAL (
 SELECT count(daily_val) AS ct
 , count(daily_val < d.daily_val OR NULL)::numeric As ct_lt
 FROM daily_data
 WHERE company_id = d.company_id
 AND trade_date < d.trade_date
 ) x
ORDER BY company_id, trade_date;

I use count(daily_val) for the full count, not count(*) to ignore NULL values - necessary if daily_val can be NULL, .

SQL Fiddle.

Or, without LATERAL join for even older versions. May be faster, too:

SELECT company_id, trade_date, daily_val, pctl
 , round(CASE WHEN ct = 0 THEN numeric '1' -- 0 ?
 ELSE ct_lt / ct END, 6) AS pctl_calc
FROM (
 SELECT d.*
 , count(x.daily_val) AS ct
 , count(x.daily_val < d.daily_val OR NULL)::numeric AS ct_lt
 FROM daily_data d
 LEFT JOIN daily_data x ON x.company_id = d.company_id
 AND x.trade_date < d.trade_date
 GROUP BY d.id -- pg 9.1+
 ) sub
ORDER BY company_id, trade_date;

Assuming id is the PK for a simplified GROUP BY, requires pg 9.1 though.

answered Feb 11, 2017 at 8:53
5
  • @mountainclimber: The condition on company_id = company_idwas missing. Also, check if your values can be NULL ... Commented Feb 16, 2017 at 18:34
  • 1
    I am sorry, my update was in a hurry and wrong. The answer was correct to begin with. Reverted changes and made code null-safe instead of just warning against possible null values now. See updated fiddle (if the site isn't overloaded). The only error I see are rounding errors on your side. Commented Feb 16, 2017 at 22:10
  • 1
    @mountainclimber: You might add a TABLESPACE on another disk and move some objects there. See: dba.stackexchange.com/q/62969/3684 Commented Feb 17, 2017 at 20:43
  • 1
    for the 9.4+ version, to limit lookback (make it rolling) put something like and trade_date >= (d.trade_date - interval '20 years') in the where clause Commented May 24, 2017 at 18:16
  • Erwin - Do you think this query could be sped up via set max_parallel_workers_per_gather = 2 assuming I have enough resources to accommodate it? Commented Aug 10, 2017 at 16:37

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.