How do I calculate the percentile of the nth element in a field relative to the ones previous to it in a time series for each subset (partition) of my data?
For example, this is exactly what I want to do, except it is a cumulative sum, not a percentile:
select *
from
(SELECT trade_date, daily_val, id, company_id, sum(daily_val)
over (partition by company_id order by trade_date rows between unbounded preceding and current row) AS sum
FROM daily_data
where (company_id = 858 or company_id = 726) and wh_calc_id = 156) as x
order by company_id,trade_date
This is what I want to do:
select *
from
(SELECT trade_date, daily_val, id, company_id, percent_rank(daily_val)
over (partition by company_id order by trade_date rows between unbounded preceding and current row) AS pctl
FROM daily_data
where (company_id = 858 or company_id = 726) and wh_calc_id = 156) as x
order by company_id,trade_date
(Note: the only difference between the two is sum() vs percent_rank())
It seems I need a percentile rank function available in PostgreSQL 9.5 that takes a sorted expression argument. I know it exists for more recent versions:
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html
And the 9.3 version doesn't seem to take an argument:
https://www.postgresql.org/docs/9.3/static/functions-window.html
I really don't want to upgrade PostgreSQL as I already have a lot going on right now.
Here is some simplified example output of what I want (created in Excel):
trade_date daily_val pctl company_id
1/1/1900 1484.881621 1.000000 1
1/2/1900 1465.465315 0.000000 1
1/3/1900 1471.244094 0.500000 1
1/4/1900 1475.109578 0.666000 1
1/5/1900 1484.881621 0.750000 1
1/6/1900 1508.863507 1.000000 1
1/7/1900 1519.089431 1.000000 1
1/8/1900 1504.819339 0.714000 1
1/9/1900 1523.210769 1.000000 1
1/10/1900 1514.990335 0.777000 1
1/11/1900 1550.595515 1.000000 1
1/12/1900 1529.454907 0.909000 1
1/13/1900 1546.324863 0.916000 1
1/14/1900 1542.077672 0.846000 1
1/15/1900 1561.396941 1.000000 1
1/16/1900 1519.089431 0.533000 1
1/17/1900 1498.774337 0.312000 1
1/18/1900 1486.847801 0.294000 1
1/19/1900 1488.819195 0.333000 1
1/20/1900 1506.83871 0.473000 1
1/21/1900 1500.777505 0.400000 1
1/1/1900 149.113683 1.000000 5
1/2/1900 155.505015 1.000000 5
1/3/1900 154.15655 0.500000 5
1/4/1900 160.425942 1.000000 5
1/5/1900 156.29799 0.750000 5
1/6/1900 164.058078 1.000000 5
1/7/1900 159.935225 0.666000 5
1/8/1900 158.537455 0.571000 5
1/9/1900 154.147639 0.125000 5
1/10/1900 155.722962 0.444000 5
1/11/1900 155.505015 0.300000 5
1/12/1900 162.766282 0.909000 5
1/13/1900 169.264949 1.000000 5
1/14/1900 167.73179 0.923000 5
1/15/1900 162.280915 0.714000 5
1/16/1900 159.906452 0.533000 5
1/17/1900 161.04602 0.687000 5
1/18/1900 157.656379 0.411000 5
1/19/1900 154.478044 0.166000 5
1/20/1900 154.576546 0.210000 5
1/21/1900 154.147639 0.050000 5
ntile
doesn't work either.
-
I assume, by available in PostgreSQL 9.3.5 you really meant PostgreSQL 9.5 (as the manual link indicates)? Either way, hypothetical-set aggregate functions where introduced with 9.4.Erwin Brandstetter– Erwin Brandstetter2017年02月11日 08:57:56 +00:00Commented Feb 11, 2017 at 8:57
-
9.3 is what I have. I provided the link to 9.5 bc I thought it would be useful to people who have 9.5, so they they don't have do what I am doing. Didn't know it was available in 9.4.mountainclimber11– mountainclimber112017年02月11日 13:24:46 +00:00Commented Feb 11, 2017 at 13:24
-
Not yet. I am working on some infrastructure around this, so I can test this. If you look at my most recent question you will see how it relates. I will credit the answer as soon as I can test it. I look at this every day.mountainclimber11– mountainclimber112017年02月16日 13:49:09 +00:00Commented Feb 16, 2017 at 13:49
1 Answer 1
In Postgres 9.4+, you can use the hypothetical-set aggregate function percent_rank()
in a LATERAL
join like this:
SELECT *
FROM daily_data d, LATERAL (
SELECT round(percent_rank(d.daily_val) WITHIN GROUP (ORDER BY daily_val)::numeric
, 6) AS pctl_calc
FROM daily_data
WHERE company_id = d.company_id
AND trade_date < d.trade_date
) x
ORDER BY company_id, trade_date;
The CROSS JOIN LATERAL
(, LATERAL
for short) is safe, because a subquery with an aggregate function always returns a row. (Else you might want LEFT JOIN LATERAL (...) x ON true
.)
Minor difference: The first row in each group returns 0, not 1 like in your example. Easily fixed, if you need it.
This would do the same job in Postgres 9.3:
SELECT d.*
, round(CASE WHEN x.ct = 0 THEN numeric '1' -- or 0 ?
ELSE x.ct_lt / x.ct END, 6) AS pctl_calc
FROM daily_data d, LATERAL (
SELECT count(daily_val) AS ct
, count(daily_val < d.daily_val OR NULL)::numeric As ct_lt
FROM daily_data
WHERE company_id = d.company_id
AND trade_date < d.trade_date
) x
ORDER BY company_id, trade_date;
I use count(daily_val)
for the full count, not count(*)
to ignore NULL values - necessary if daily_val
can be NULL, .
Or, without LATERAL
join for even older versions. May be faster, too:
SELECT company_id, trade_date, daily_val, pctl
, round(CASE WHEN ct = 0 THEN numeric '1' -- 0 ?
ELSE ct_lt / ct END, 6) AS pctl_calc
FROM (
SELECT d.*
, count(x.daily_val) AS ct
, count(x.daily_val < d.daily_val OR NULL)::numeric AS ct_lt
FROM daily_data d
LEFT JOIN daily_data x ON x.company_id = d.company_id
AND x.trade_date < d.trade_date
GROUP BY d.id -- pg 9.1+
) sub
ORDER BY company_id, trade_date;
Assuming id
is the PK for a simplified GROUP BY
, requires pg 9.1 though.
-
@mountainclimber: The condition on
company_id = company_id
was missing. Also, check if your values can be NULL ...Erwin Brandstetter– Erwin Brandstetter2017年02月16日 18:34:24 +00:00Commented Feb 16, 2017 at 18:34 -
1I am sorry, my update was in a hurry and wrong. The answer was correct to begin with. Reverted changes and made code null-safe instead of just warning against possible null values now. See updated fiddle (if the site isn't overloaded). The only error I see are rounding errors on your side.Erwin Brandstetter– Erwin Brandstetter2017年02月16日 22:10:20 +00:00Commented Feb 16, 2017 at 22:10
-
1@mountainclimber: You might add a
TABLESPACE
on another disk and move some objects there. See: dba.stackexchange.com/q/62969/3684Erwin Brandstetter– Erwin Brandstetter2017年02月17日 20:43:43 +00:00Commented Feb 17, 2017 at 20:43 -
1for the 9.4+ version, to limit lookback (make it rolling) put something like
and trade_date >= (d.trade_date - interval '20 years')
in the where clausemountainclimber11– mountainclimber112017年05月24日 18:16:53 +00:00Commented May 24, 2017 at 18:16 -
Erwin - Do you think this query could be sped up via
set max_parallel_workers_per_gather = 2
assuming I have enough resources to accommodate it?mountainclimber11– mountainclimber112017年08月10日 16:37:08 +00:00Commented Aug 10, 2017 at 16:37
Explore related questions
See similar questions with these tags.