I'm trying to find the best way to calculate percentiles for a dataset in bulk on Postgres, returning both the percentile and the value.
generate_series()
plus percentile_cont()
and lateral join will work, though this can be very slow with many percentiles, as percentile_cont() is called for each item in the series:
select percentile, value
from generate_series(0, 1, 0.2) as percentile,
lateral (select percentile_cont(percentile) WITHIN GROUP (ORDER BY SIZE)
as value from mytable
) values
This is much faster but doesn't return the percentile corresponding to each value:
select unnest(
percentile_cont(
(select array_agg(s) from generate_series(0, 1, 0.2) as s)
) WITHIN GROUP (ORDER BY SIZE))
from mytable
And trying to join that one with a second generate_series()
seems to require an ugly hack to get a column to join on:
select percentile, value
from (select row_number() OVER () as rownum, value
from (select unnest(
percentile_cont(
(select array_agg(s) from generate_series(0, 1, 0.2) as s)
) WITHIN GROUP (ORDER BY SIZE)) as value
from mytable) s2
) s3
join (select row_number() OVER () as rownum, percentile
from (select generate_series(0, 1, 0.2) as percentile) s) p
on p.rownum = s3.rownum
Or taking another approach, ntiles()
makes for a cleaner query, but values don't quite match with percentile_cont()
. (Maybe they'd match with percentile_disc()
?) And this is about half as fast as 2nd/3rd queries above.
select ((nt-1)/5::float) as percentile, min(size) as value
from (select size, ntile(6) over (order by size) as nt
from mytable
) as dt
group by nt
order by nt asc
I've gotten this far, but it's still not quite right, and something tells me there's a clean, performant approach out there. Any thoughts on how to improve on these?
EDIT: modifying Sven's answer slightly to return named columns:
SELECT unnest(
(select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction)
) as percentile,
unnest(
(select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
) as value;
2 Answers 2
The following query should get you both the percentile and the value
SELECT * FROM unnest(
(select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction),
(select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
);
-
This is great, I didn't think to try that. Thanks!evan_b– evan_b2018年08月13日 17:45:40 +00:00Commented Aug 13, 2018 at 17:45
Same answer as @Sven Klemm, but broken down using with
blocks:
WITH serie as (
SELECT array_agg(gs) as v
FROM generate_series(0, 1, 0.1) as gs
),
percentiles as (
SELECT
percentile_cont((select v from serie))
WITHIN GROUP (ORDER BY "SIZE") as p
FROM mytable
)
SELECT
unnest(s.v) * 100 as percentile,
unnest(p.p) as value
FROM percentiles p, serie s
Which gives something like this:
percentile | value
------------+-------
0 | 8.9
10.0 | 18.27
20.0 | 22.16
30.0 | 24.45
40.0 | 30.36
50.0 | 38.05
60.0 | 43.06
70.0 | 50.33
80.0 | 60.06
90.0 | 61.7
100.0 | 76.1