Bulk percentile calculation in Postgres

Question 1

I'm trying to find the best way to calculate percentiles for a dataset in bulk on Postgres, returning both the percentile and the value.

generate_series() plus percentile_cont() and lateral join will work, though this can be very slow with many percentiles, as percentile_cont() is called for each item in the series:

select percentile, value
from generate_series(0, 1, 0.2) as percentile,
lateral (select percentile_cont(percentile) WITHIN GROUP (ORDER BY SIZE) 
 as value from mytable
) values

This is much faster but doesn't return the percentile corresponding to each value:

select unnest(
percentile_cont(
 (select array_agg(s) from generate_series(0, 1, 0.2) as s)
) WITHIN GROUP (ORDER BY SIZE))
from mytable

And trying to join that one with a second generate_series() seems to require an ugly hack to get a column to join on:

select percentile, value
from (select row_number() OVER () as rownum, value
 from (select unnest(
 percentile_cont(
 (select array_agg(s) from generate_series(0, 1, 0.2) as s)
 ) WITHIN GROUP (ORDER BY SIZE)) as value
 from mytable) s2
 ) s3
join (select row_number() OVER () as rownum, percentile
 from (select generate_series(0, 1, 0.2) as percentile) s) p
 on p.rownum = s3.rownum

Or taking another approach, ntiles() makes for a cleaner query, but values don't quite match with percentile_cont(). (Maybe they'd match with percentile_disc()?) And this is about half as fast as 2nd/3rd queries above.

select ((nt-1)/5::float) as percentile, min(size) as value
from (select size, ntile(6) over (order by size) as nt
 from mytable
 ) as dt
group by nt
order by nt asc

I've gotten this far, but it's still not quite right, and something tells me there's a clean, performant approach out there. Any thoughts on how to improve on these?

EDIT: modifying Sven's answer slightly to return named columns:

SELECT unnest(
 (select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction)
) as percentile,
unnest(
 (select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
) as value;

Question 2

The following query should get you both the percentile and the value

SELECT * FROM unnest(
 (select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction),
 (select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
);

Question 3

This is great, I didn't think to try that. Thanks!

Question 4

Same answer as @Sven Klemm, but broken down using with blocks:

WITH serie as (
 SELECT array_agg(gs) as v 
 FROM generate_series(0, 1, 0.1) as gs
),
percentiles as (
 SELECT
 percentile_cont((select v from serie)) 
 WITHIN GROUP (ORDER BY "SIZE") as p
 FROM mytable
)
SELECT 
 unnest(s.v) * 100 as percentile, 
 unnest(p.p) as value
FROM percentiles p, serie s

Which gives something like this:

 percentile | value
------------+-------
 0 | 8.9
 10.0 | 18.27
 20.0 | 22.16
 30.0 | 24.45
 40.0 | 30.36
 50.0 | 38.05
 60.0 | 43.06
 70.0 | 50.33
 80.0 | 60.06
 90.0 | 61.7
 100.0 | 76.1

Sven Klemm Sven Klemm 1764 bronze badges · Accepted Answer · 2018-08-13 05:31:33Z

The following query should get you both the percentile and the value

SELECT * FROM unnest(
 (select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction),
 (select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
);

This is great, I didn't think to try that. Thanks!

evan_b
– evan_b

2018年08月13日 17:45:40 +00:00
Commented Aug 13, 2018 at 17:45

Stack Exchange Network

Bulk percentile calculation in Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Bulk percentile calculation in Postgres

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions