PostgreSQL 9.6 performance problem with Sum() aggregate query

Question 1

I have the following table:

 id | bigint | not null default nextval('shares_id_seq'::regclass)
 poolid | text | not null
 blockheight | bigint | not null
 networkdifficulty | double precision | not null
 miner | text | not null
 worker | text |
 ipaddress | text | not null
 created | timestamp without time zone | not null
 useragent | text |
 payoutinfo | text |
 difficulty | double precision | not null default 0
Indexes:
 "shares_pkey" PRIMARY KEY, btree (id)
 "idx_shares_pool_block" btree (poolid, blockheight)
 "idx_shares_pool_created" btree (poolid, created)
 "idx_shares_pool_miner" btree (poolid, miner)
 "idx_shares_pool_miner_diff" btree (poolid, miner, difficulty)

I cannot wrap my head around why the following query takes so long:

explain analyze SELECT SUM(difficulty) FROM shares WHERE poolid = 'xmr1' AND miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm';
 QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate (cost=4150712.54..4150712.55 rows=1 width=8) (actual time=25490.101..25490.101 rows=1 loops=1)
 -> Bitmap Heap Scan on shares (cost=389414.64..4143195.97 rows=3006629 width=8) (actual time=2499.409..24815.011 rows=7445802 loops=1)
 Recheck Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
 Rows Removed by Index Recheck: 55232916
 Heap Blocks: exact=149273 lossy=2641988
 -> Bitmap Index Scan on idx_shares_pool_miner (cost=0.00..388662.98 rows=3006629 width=0) (actual time=2449.977..2449.977 rows=7445802 loops=1)
 Index Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
 Planning time: 0.256 ms
 Execution time: 25490.137 ms
(9 rows)

I specifically created index idx_shares_pool_miner_diff for this scenario, yet it isn't even used. Can't PostgreSQL use the index due to MVCC?

Update:

Running vacuum analyze shares as suggested made PostgreSQL use the aforementioned index:

 Aggregate (cost=546165.94..546165.95 rows=1 width=8) (actual time=2489.446..2489.447 rows=1 loops=1)
 -> Index Only Scan using idx_shares_pool_miner_diff on shares (cost=0.69..537874.79 rows=3316458 width=8) (actual time=0.041..1924.396 rows=7461785 loops=1)
 Index Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
 Heap Fetches: 16575
 Planning time: 0.122 ms
 Execution time: 2489.477 ms
(6 rows)

I assume that's as fast as it gets without using parallel workers.

After reading the docs about index-only-scans, it looks like I'm taking a performance hit due to the constant heavy write activity on the shares table (> 200 inserts/sec), causing the many Heap-Fetches. Interestingly the number of Heap-Fetches is apparently only going up, never down.

Question 2

Does this change if you do a vacuum analyze shares?

Question 3

@a_horse_with_no_name Thanks. That made a difference. Updated the question.

Question 4

If your table is INSERT-only, there are ways to get your sums (much) faster.

Assuming there is a column with monotonically increasing values (like id or created in your example), create a MATERIALZED VIEW to pre-compute sums older than a (recent) given threshold. And then just add the sum of recent additions to it:

CREATE MATERIALIZED VIEW shares_summed AS
SELECT poolid, miner, SUM(difficulty) AS sum_diff
FROM shares
GROUP BY poolid, miner
ORDER BY poolid, miner; -- optional, but to optimize some more
WHERE created < '2018-01-01 0:0';

If there are many combinations of (poolid, miner), add a UNIQUE index on that. Also required to refresh CONCURRENTLY, see below.

Fetching a row from that table cost next to nothing. Then you only add recent additions to it:

SELECT sum(sum_diff) AS total_sum -- takes care of possible missing rows
FROM (
 SELECT sum_diff
 FROM shares_summed
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 UNION ALL
 SELECT SUM(difficulty)
 FROM shares
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 AND created >= '2018-01-01 0:0'
 ) sub;

You only have to adapt the threshold value from time to time and REFRESH the MV. Best with the CONCURRENTLY option, to avoid concurrency issues with your many INSERTs. Consider:

Analyze needed after a refresh materialized view?

Remember the value and adapt queries accordingly. You might store it in another table.

To make it even faster, create a partitioned table and add another partition with pre-computed sums after every month (or whatever). You can easily schedule a monthly cron job to automate this. Then the threshold is always the start of the month or so:

SELECT sum(sum_diff) AS total_sum
FROM (
 SELECT SUM(difficulty) AS sum_diff
 FROM shares_summed_master -- includes all partitions
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 UNION ALL
 SELECT SUM(difficulty)
 FROM shares
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 AND created >= date_trunc('month', now()) -- careful, current time zone affects it
 ) sub;

Make sure there is no overlap, and pre-computed sums are current.

Table partitioning was largely improved and simplified with Postgres 10. Be sure to run the latest point release, some corner case bugs were fixed in Postgres 10.2. The above link to the manual is for version 9.6. Current version here.

Question 5

Tried both proposed solutions and then - out of curiosity - experimented with plain old triggers for insert and delete, which update corresponding rows in a pre-aggregation table which worked out nicely so far and is a bit easier to manage.

score 4 · Accepted Answer · 2018-02-10 09:07:49Z

If your table is INSERT-only, there are ways to get your sums (much) faster.

Assuming there is a column with monotonically increasing values (like id or created in your example), create a MATERIALZED VIEW to pre-compute sums older than a (recent) given threshold. And then just add the sum of recent additions to it:

CREATE MATERIALIZED VIEW shares_summed AS
SELECT poolid, miner, SUM(difficulty) AS sum_diff
FROM shares
GROUP BY poolid, miner
ORDER BY poolid, miner; -- optional, but to optimize some more
WHERE created < '2018-01-01 0:0';

If there are many combinations of (poolid, miner), add a UNIQUE index on that. Also required to refresh CONCURRENTLY, see below.

Fetching a row from that table cost next to nothing. Then you only add recent additions to it:

SELECT sum(sum_diff) AS total_sum -- takes care of possible missing rows
FROM (
 SELECT sum_diff
 FROM shares_summed
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 UNION ALL
 SELECT SUM(difficulty)
 FROM shares
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 AND created >= '2018-01-01 0:0'
 ) sub;

You only have to adapt the threshold value from time to time and REFRESH the MV. Best with the CONCURRENTLY option, to avoid concurrency issues with your many INSERTs. Consider:

Analyze needed after a refresh materialized view?

Remember the value and adapt queries accordingly. You might store it in another table.

To make it even faster, create a partitioned table and add another partition with pre-computed sums after every month (or whatever). You can easily schedule a monthly cron job to automate this. Then the threshold is always the start of the month or so:

SELECT sum(sum_diff) AS total_sum
FROM (
 SELECT SUM(difficulty) AS sum_diff
 FROM shares_summed_master -- includes all partitions
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 UNION ALL
 SELECT SUM(difficulty)
 FROM shares
 WHERE poolid = 'xmr1'
 AND miner = '4BCeEPhod...'
 AND created >= date_trunc('month', now()) -- careful, current time zone affects it
 ) sub;

Make sure there is no overlap, and pre-computed sums are current.

Table partitioning was largely improved and simplified with Postgres 10. Be sure to run the latest point release, some corner case bugs were fixed in Postgres 10.2. The above link to the manual is for version 9.6. Current version here.

Tried both proposed solutions and then - out of curiosity - experimented with plain old triggers for insert and delete, which update corresponding rows in a pre-aggregation table which worked out nicely so far and is a bit easier to manage.

Stack Exchange Network

PostgreSQL 9.6 performance problem with Sum() aggregate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

PostgreSQL 9.6 performance problem with Sum() aggregate query

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions