How to optimize fetching top N rows over sum w/ group by?

Question 1

I have a big table with these fields:

CREATE UNLOGGED TABLE mytable (
 ts NUMERIC(16,6) NOT NULL,
 value INTEGER NOT NULL,
 fieldA INTEGER NOT NULL,
 fieldB INTEGER NOT NULL,
 ...
);

I am trying to find those values of fieldA which have the highest sum of value over a give time interval (1 hour, 6 hours or 1 day). In other words, for a given interval I would like to get top 10 sums of value per fieldA, plus a matching value of this field.

Query says more than 1000 words:

 SELECT
 fieldA, sum(value) "aggr"
 FROM
 mytable
 WHERE
 ts >= 1234567890 AND
 ts < 1234567890 + 24 * 3600
 GROUP BY
 fieldA
 ORDER BY
 aggr DESC
 LIMIT 10;

Result:

 fieldA | aggr 
---------------+------------
 140 | 3147666070
 127 | 2647653771
 182 | 1247401380
 3 | 1247372688
 68 | 1246742329
 227 | 1246433376
 54 | 1246100364
 243 | 1245971364
 191 | 1245912876
 62 | 1245818815
(10 rows)

This query returns correct results, but I am having trouble optimizing it. Note that I am running a similar query for fieldA, fieldB,... (around 5 fields) once per hour, once every 6 hours and once per day, with time intervals changed appropriately. When running it for 24h, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel - will fix that).

There is currently around 30 million new records per day and I would like to make it capable of more, running on a single server (this is the reason for UNLOGGED - in this case I don't mind losing data on crash). Not sure if it matters: while I am currently simply dropping and re-creating the table every 5 - 10 days, in the future I will change it to use partitions so that I can remove data older than ~3 days.

This is how the query plan looks like for 24h when there is around 31 hours worth of data (~40M rows) in the table:

# explain analyze SELECT fieldA, sum(value) "aggr" FROM mytable WHERE ts >= 1000000000 AND ts < 1000086400 GROUP BY fieldA ORDER BY aggr DESC LIMIT 10;
 QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit (cost=739891.81..739891.84 rows=10 width=15) (actual time=16343.876..16348.483 rows=10 loops=1)
 -> Sort (cost=739891.81..739892.45 rows=253 width=15) (actual time=15876.302..15876.304 rows=10 loops=1)
 Sort Key: (sum(value)) DESC
 Sort Method: top-N heapsort Memory: 25kB
 -> Finalize GroupAggregate (cost=739822.25..739886.35 rows=253 width=15) (actual time=15875.512..15876.200 rows=253 loops=1)
 Group Key: fieldA
 -> Gather Merge (cost=739822.25..739881.29 rows=506 width=15) (actual time=15875.494..15880.493 rows=759 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 -> Sort (cost=738822.23..738822.86 rows=253 width=15) (actual time=15836.782..15836.810 rows=253 loops=3)
 Sort Key: fieldA
 Sort Method: quicksort Memory: 36kB
 Worker 0: Sort Method: quicksort Memory: 36kB
 Worker 1: Sort Method: quicksort Memory: 36kB
 -> Partial HashAggregate (cost=738809.60..738812.13 rows=253 width=15) (actual time=15836.524..15836.599 rows=253 loops=3)
 Group Key: fieldA
 -> Parallel Seq Scan on mytable (cost=0.00..669906.49 rows=13780622 width=11) (actual time=271.628..12076.394 rows=10439990 loops=3)
 Filter: ((ts >= '1000000000'::numeric) AND (ts < '1000086400'::numeric))
 Rows Removed by Filter: 3045010
 Planning Time: 0.227 ms
 JIT:
 Functions: 31
 Options: Inlining true, Optimization true, Expressions true, Deforming true
 Timing: Generation 18.778 ms, Inlining 251.228 ms, Optimization 665.295 ms, Emission 365.024 ms, Total 1300.324 ms
 Execution Time: 16357.350 ms
(25 rows)

Is there some way I can optimize these kinds of queries? Even creating an index on ts didn't seem to help with this query (index wasn't used - I assume because most of the rows were used anyway).

Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database. I would appreciate some ideas.

Question 2

When running it for 24h, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel - will fix that).

Yeah, fix that and see if you still have a problem. Also, disable parallelization within individuals queries as well, by setting max_parallel_workers_per_gather = 0 for these queries. Also, consider looking at those other queries being made slow, to see if they can be improved. That might be easier than improving these queries.

Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database.

It seems to me that the problem is not with the total amount of work, but rather that the distribution of the work over time is awkward. So I think this idea has a lot of merit.

Also, you could store the aggregates hourly (without the LIMIT), then just reaggregate those aggregates to higher time levels. This depends on columnA, columnB, etc. each having many fewer distinct values than the total dataset has rows.

Question 3

Yes, I was afraid of that... Unfortunately at least with some of the fields the amount of possible values is huge, but I might be able to do that with some of them if needed. Thank you for your hint about max_parallel_workers_per_gather, I wasn't aware of that - I have set it now to 0 (disable parallelization). I have now also ensured that the queries are serialized and that there is some sleep() between them, so - finger crossed, will see tomorrow how it goes. Thank you, appreciate it!

jjanes jjanes 42.5k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2020-04-13 18:47:58Z

When running it for 24h, I see spikes in server load which cause other queries to become unacceptably slow (it is true however that I'm running multiple such queries in parallel - will fix that).

Yeah, fix that and see if you still have a problem. Also, disable parallelization within individuals queries as well, by setting max_parallel_workers_per_gather = 0 for these queries. Also, consider looking at those other queries being made slow, to see if they can be improved. That might be easier than improving these queries.

Alternatively, can I organize data differently? I thought about aggregating it manually as it comes and just updating aggregated values, but it seems to me this would mean even more work for the database.

It seems to me that the problem is not with the total amount of work, but rather that the distribution of the work over time is awkward. So I think this idea has a lot of merit.

Also, you could store the aggregates hourly (without the LIMIT), then just reaggregate those aggregates to higher time levels. This depends on columnA, columnB, etc. each having many fewer distinct values than the total dataset has rows.

Yes, I was afraid of that... Unfortunately at least with some of the fields the amount of possible values is huge, but I might be able to do that with some of them if needed. Thank you for your hint about max_parallel_workers_per_gather, I wasn't aware of that - I have set it now to 0 (disable parallelization). I have now also ensured that the queries are serialized and that there is some sleep() between them, so - finger crossed, will see tomorrow how it goes. Thank you, appreciate it!

Stack Exchange Network

How to optimize fetching top N rows over sum w/ group by?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to optimize fetching top N rows over sum w/ group by?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions