I have the following table:
id | bigint | not null default nextval('shares_id_seq'::regclass)
poolid | text | not null
blockheight | bigint | not null
networkdifficulty | double precision | not null
miner | text | not null
worker | text |
ipaddress | text | not null
created | timestamp without time zone | not null
useragent | text |
payoutinfo | text |
difficulty | double precision | not null default 0
Indexes:
"shares_pkey" PRIMARY KEY, btree (id)
"idx_shares_pool_block" btree (poolid, blockheight)
"idx_shares_pool_created" btree (poolid, created)
"idx_shares_pool_miner" btree (poolid, miner)
"idx_shares_pool_miner_diff" btree (poolid, miner, difficulty)
I cannot wrap my head around why the following query takes so long:
explain analyze SELECT SUM(difficulty) FROM shares WHERE poolid = 'xmr1' AND miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm';
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=4150712.54..4150712.55 rows=1 width=8) (actual time=25490.101..25490.101 rows=1 loops=1)
-> Bitmap Heap Scan on shares (cost=389414.64..4143195.97 rows=3006629 width=8) (actual time=2499.409..24815.011 rows=7445802 loops=1)
Recheck Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
Rows Removed by Index Recheck: 55232916
Heap Blocks: exact=149273 lossy=2641988
-> Bitmap Index Scan on idx_shares_pool_miner (cost=0.00..388662.98 rows=3006629 width=0) (actual time=2449.977..2449.977 rows=7445802 loops=1)
Index Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
Planning time: 0.256 ms
Execution time: 25490.137 ms
(9 rows)
I specifically created index idx_shares_pool_miner_diff for this scenario, yet it isn't even used. Can't PostgreSQL use the index due to MVCC?
Update:
Running vacuum analyze shares
as suggested made PostgreSQL use the aforementioned index:
Aggregate (cost=546165.94..546165.95 rows=1 width=8) (actual time=2489.446..2489.447 rows=1 loops=1)
-> Index Only Scan using idx_shares_pool_miner_diff on shares (cost=0.69..537874.79 rows=3316458 width=8) (actual time=0.041..1924.396 rows=7461785 loops=1)
Index Cond: ((poolid = 'xmr1'::text) AND (miner = '4BCeEPhodgPMbPWFN1dPwhWXdRX8q4mhhdZdA1dtSMLTLCEYvAj9QXjXAfF7CugEbmfBhgkqHbdgK9b2wKA6nqRZQCgvCDm'::text))
Heap Fetches: 16575
Planning time: 0.122 ms
Execution time: 2489.477 ms
(6 rows)
I assume that's as fast as it gets without using parallel workers.
After reading the docs about index-only-scans, it looks like I'm taking a performance hit due to the constant heavy write activity on the shares table (> 200 inserts/sec), causing the many Heap-Fetches. Interestingly the number of Heap-Fetches is apparently only going up, never down.
1 Answer 1
If your table is INSERT-only, there are ways to get your sums (much) faster.
Assuming there is a column with monotonically increasing values (like id
or created
in your example), create a MATERIALZED VIEW
to pre-compute sums older than a (recent) given threshold. And then just add the sum of recent additions to it:
CREATE MATERIALIZED VIEW shares_summed AS
SELECT poolid, miner, SUM(difficulty) AS sum_diff
FROM shares
GROUP BY poolid, miner
ORDER BY poolid, miner; -- optional, but to optimize some more
WHERE created < '2018-01-01 0:0';
If there are many combinations of (poolid, miner)
, add a UNIQUE
index on that. Also required to refresh CONCURRENTLY
, see below.
Fetching a row from that table cost next to nothing. Then you only add recent additions to it:
SELECT sum(sum_diff) AS total_sum -- takes care of possible missing rows
FROM (
SELECT sum_diff
FROM shares_summed
WHERE poolid = 'xmr1'
AND miner = '4BCeEPhod...'
UNION ALL
SELECT SUM(difficulty)
FROM shares
WHERE poolid = 'xmr1'
AND miner = '4BCeEPhod...'
AND created >= '2018-01-01 0:0'
) sub;
You only have to adapt the threshold value from time to time and REFRESH
the MV. Best with the CONCURRENTLY
option, to avoid concurrency issues with your many INSERT
s. Consider:
Remember the value and adapt queries accordingly. You might store it in another table.
To make it even faster, create a partitioned table and add another partition with pre-computed sums after every month (or whatever). You can easily schedule a monthly cron job to automate this. Then the threshold is always the start of the month or so:
SELECT sum(sum_diff) AS total_sum
FROM (
SELECT SUM(difficulty) AS sum_diff
FROM shares_summed_master -- includes all partitions
WHERE poolid = 'xmr1'
AND miner = '4BCeEPhod...'
UNION ALL
SELECT SUM(difficulty)
FROM shares
WHERE poolid = 'xmr1'
AND miner = '4BCeEPhod...'
AND created >= date_trunc('month', now()) -- careful, current time zone affects it
) sub;
Make sure there is no overlap, and pre-computed sums are current.
Table partitioning was largely improved and simplified with Postgres 10. Be sure to run the latest point release, some corner case bugs were fixed in Postgres 10.2. The above link to the manual is for version 9.6. Current version here.
-
Tried both proposed solutions and then - out of curiosity - experimented with plain old triggers for insert and delete, which update corresponding rows in a pre-aggregation table which worked out nicely so far and is a bit easier to manage.Oliver Weichhold– Oliver Weichhold2018年02月17日 14:26:29 +00:00Commented Feb 17, 2018 at 14:26
Explore related questions
See similar questions with these tags.
vacuum analyze shares
?