1

I'm trying to find the best way to calculate percentiles for a dataset in bulk on Postgres, returning both the percentile and the value.

generate_series() plus percentile_cont() and lateral join will work, though this can be very slow with many percentiles, as percentile_cont() is called for each item in the series:

select percentile, value
from generate_series(0, 1, 0.2) as percentile,
lateral (select percentile_cont(percentile) WITHIN GROUP (ORDER BY SIZE) 
 as value from mytable
) values

This is much faster but doesn't return the percentile corresponding to each value:

select unnest(
percentile_cont(
 (select array_agg(s) from generate_series(0, 1, 0.2) as s)
) WITHIN GROUP (ORDER BY SIZE))
from mytable

And trying to join that one with a second generate_series() seems to require an ugly hack to get a column to join on:

select percentile, value
from (select row_number() OVER () as rownum, value
 from (select unnest(
 percentile_cont(
 (select array_agg(s) from generate_series(0, 1, 0.2) as s)
 ) WITHIN GROUP (ORDER BY SIZE)) as value
 from mytable) s2
 ) s3
join (select row_number() OVER () as rownum, percentile
 from (select generate_series(0, 1, 0.2) as percentile) s) p
 on p.rownum = s3.rownum

Or taking another approach, ntiles() makes for a cleaner query, but values don't quite match with percentile_cont(). (Maybe they'd match with percentile_disc()?) And this is about half as fast as 2nd/3rd queries above.

select ((nt-1)/5::float) as percentile, min(size) as value
from (select size, ntile(6) over (order by size) as nt
 from mytable
 ) as dt
group by nt
order by nt asc

I've gotten this far, but it's still not quite right, and something tells me there's a clean, performant approach out there. Any thoughts on how to improve on these?

EDIT: modifying Sven's answer slightly to return named columns:

SELECT unnest(
 (select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction)
) as percentile,
unnest(
 (select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
) as value;
asked Aug 13, 2018 at 2:26

2 Answers 2

1

The following query should get you both the percentile and the value

SELECT * FROM unnest(
 (select array_agg(fraction) from generate_series(0, 1, 0.2) AS fraction),
 (select percentile_cont((select array_agg(s) from generate_series(0, 1, 0.2) as s)) WITHIN GROUP (ORDER BY SIZE) FROM mytable)
);
answered Aug 13, 2018 at 5:31
1
  • This is great, I didn't think to try that. Thanks! Commented Aug 13, 2018 at 17:45
0

Same answer as @Sven Klemm, but broken down using with blocks:

WITH serie as (
 SELECT array_agg(gs) as v 
 FROM generate_series(0, 1, 0.1) as gs
),
percentiles as (
 SELECT
 percentile_cont((select v from serie)) 
 WITHIN GROUP (ORDER BY "SIZE") as p
 FROM mytable
)
SELECT 
 unnest(s.v) * 100 as percentile, 
 unnest(p.p) as value
FROM percentiles p, serie s

Which gives something like this:

 percentile | value
------------+-------
 0 | 8.9
 10.0 | 18.27
 20.0 | 22.16
 30.0 | 24.45
 40.0 | 30.36
 50.0 | 38.05
 60.0 | 43.06
 70.0 | 50.33
 80.0 | 60.06
 90.0 | 61.7
 100.0 | 76.1
 
answered Aug 4, 2020 at 13:26

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.