Wednesday, 24 September 2025
Parallel Apply of Large Transactions
Enabling Parallel Apply
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
---------+-----------
sub | p
(1 row)
You can control the degree of parallelism using the max_parallel_apply_workers_per_subscription parameter. By default, this is set to 2, allowing two large transactions to be applied in parallel per subscription.
How Parallel Apply Works
Parallel apply significantly improves the efficiency of large transaction handling by not waiting for the COMMIT message from the publisher before applying changes. Previously, the subscriber would write all streamed data to temporary files and apply them only after receiving the commit record.
Architecture Overview
When the first stream of a large transaction arrives, the leader apply worker (LA) assigns a parallel apply worker (PA)—if available. The LA sends changes to the PA via a dedicated shared memory queue (shm_mq). The PA applies the changes and waits for the next stream. This worker remains assigned until the transaction is committed.
At commit time, the LA waits for the PA to finish to ensure commit order is preserved, which is critical to avoid:
- Transaction dependencies: e.g., inserting a row in one transaction and updating it in another.
- Deadlocks: e.g., transactions updating the same rows/tables in reverse order.
Locking Considerations
Parallel apply introduces potential deadlocks, especially when transactions are independent on the publisher but become interdependent on the subscriber due to schema differences (e.g., constraints, indexes).
To detect such deadlocks between LA and PA workers, PostgreSQL uses heavy-weight locks during:
- PA waiting for the next stream
- LA waiting for PA to finish
Example 1: Deadlock Due to Conflicting Changes
Consider a scenario where a table lacks a unique key on the publisher but has one on the subscriber. If:
- PA is applying TX-1
- LA is applying TX-2
...and both transactions conflict due to the unique constraint, a deadlock can occur:
- LA waits for a lock on the unique index
- PA waits for the next stream from LA
- → Deadlock cycle: LA → PA → LA
PostgreSQL detects and reports such deadlocks. Users can skip problematic transactions using: ALTER SUBSCRIPTION ... SKIP ... command.
Example 2: Deadlock Due to shm_mq Buffer Full
In a variant of the previous example:
- PA-1 and PA-2 apply TX-1 and TX-2 respectively.
- PA-2 waits for PA-1 to finish.
- PA-1 waits for input from LA.
- If the
shm_mqbuffer between LA and PA-2 is full, LA cannot send messages.
To avoid this:
- LA uses non-blocking writes with a timeout.
- If the timeout is exceeded, LA serializes pending messages to a file.
- PA-2 reads from the file, freeing LA to continue processing PA-1.
This mechanism prevents deadlocks caused by full shm_mq buffers.
Performance
What’s Next?
Sunday, 10 May 2020
Improved (auto)vacuum in PostgreSQL 13
Wednesday, 19 February 2020
Parallelism, what next?
Friday, 25 May 2018
Parallel Index Scans In PostgreSQL
There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (Parallel Sequential Scans). Each of the past three releases (including PG-11, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the PostgreSQL docs or from a blog post on this topic by my colleague Robert Haas. The intent of this blog post is to talk about parallel index scans which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.
To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.
explain analyze
from lineitem
LIMIT 1;
Non-parallel version of plan
-------------------------------------
Limit
-> Aggregate
-> Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
Execution Time: 35073.886 ms
Parallel version of plan
-------------------------------
Limit
-> Finalize Aggregate
-> Gather
-> Partial Aggregate
-> Parallel Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
Execution Time: 15545.794 ms
We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.
To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using TPC-H like benchmark for PostgreSQL. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)
Non-default parameter settings:
random_page_cost = seq_page_cost = 0.1
effective_cache_size = 10GB
shared_buffers = 8GB
work_mem = 1GB
The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.
We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).
Under the Hood
A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows will be divided equally among workers.
In the end, I would like to thank the people (Rahila Syed and Robert Haas) who were involved in this work (along with me) and my employer EnterpriseDB who has supported this work. I would also like to thank Rafia Sabih who helped me in doing performance testing for this blog.
Monday, 5 March 2018
zheap: a storage engine to provide better control over bloat
Friday, 17 March 2017
Hash indexes are faster than Btree indexes?
PostgreSQL have supported Hash Index for a long time, but they are not much used in production mainly because they are not durable. Now, with the next version of PostgreSQL, they will be durable. The immediate question is how do they perform as compared to Btree indexes. There is a lot of work done in the coming version to make them faster. There are multiple ways in which we can compare the performance of Hash and Btree indexes, like the time taken for creation of the index, search or insertion in the index. This blog will mainly focus on the search operation. By definition, hash indexes are O(1) and Btree indexes are O(log n), however with duplicates that is not exactly true.
To start with let us see the impact of work being done to improve the performance of hash indexes. Below is the performance data of the pgbench read-only workload to compare the performance difference of Hash indexes between 9.6 and HEAD on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM.
The workload is such that all the data fits in shared buffers (scale factor is 300 (~4.5GB) and shared_buffers is 8GB). As we can see from the above graph, that the performance has increased at all client counts in the range of 7% to 81% and the impact is more pronounced at higher client counts. The main work which has led to this improvement is 6d46f478 (Improve hash index bucket split behavior.) and 293e24e5 (Cache hash index's metapage in rel->rd_amcache.).
The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages. In general, acquiring the heavyweight lock is costlier as compare to lightweight locks. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the lifetime of the split.
The second commit 293e24e5 avoids a significant amount of contention for accessing metapage. Each search operation needs to access metapage to find the bucket that contains tuple being searched which leads to high contention around metapage. Each access to metapage needs to further access buffer manager. This work avoids that contention by caching the metapage information in backend local cache which helps bypassing all the buffer manager related work and hence the major contention in accessing the metapage.
The next graph shows how the hash index performs as compared to the btree index. In this run we have changed hash to btree index in pgbench read-only tests.
We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.
The important thing to note about the above data is that it is only on some of the specific workloads and it mainly covers Selects as that is the main area where performance improvement work has been done for PostgreSQL10. The other interesting parameters to compare are the size of the index and update on the index which needs more study and experiments.
In the end, I would like to thank my colleagues who were directly involved in this work and my employer EnterpriseDB who has supported this work. Firstly I would like to thank, Robert Haas who has envisioned all this work and is the committer of this work, and Mithun C Y who was the author of commit 293e24e5. Also, I would like to extend sincere thanks to all the community members who are involved in this work and especially Jeff Janes and Jesper Pedersen who have reviewed and tested this work.
Sunday, 29 November 2015
Parallel Sequential Scans in play
Parallelism is now reality in PostgreSQL. With 9.6, I hope we will see many
different form of queries that can use parallelism to execute. For now, I will
limit this discussion to what we can already do, which is Parallel Sequential
Scans.
Parallel Sequential Scans are used to scan a relation parallely with the help of
background workers which in turns improve the performance of such scans. I
will discuss about the scenarios where user can expect a performance boost
due to this feature later in this blog, but first let us understand the basic feature
and how it works. Three new GUC parameters have been added to tune the
usage of this feature.
max_parallel_degree - This is used to set the maximum number of workers that
can be used for an individual parallel operation. It is very well possible that the
requested number of workers are not available at execution time. Parallel workers
are taken from the pool of processes established by max_worker_processes which
means that value of max_parallel_degree should be lesser than max_worker_processes.
It might not be useful to set the value of this parameter more than the number of CPU
count on your system.
parallel_tuple_cost - This is used by planner to estimate the cost of transferring a
tuple from parallel worker process to master backend. The default is 0.1. The more
the number of tuples that needs to be passed from worker backend processes to
master backend process, the more this cost will be and more overall cost of
parallel sequential scan plan.
parallel_setup_cost - This is used by planner to estimate the cost of launching parallel
worker processes and setting up dynamic shared memory to communicate.
The default is 1000.
Now let us see the simple example to demonstrate how parallel sequential scan works:
create table tbl_parallel_test(c1 int, c2 char(1000));
insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa');
Analyze tbl_parallel_test;
Explain analyze select * from tbl_parallel_test where c1 < 10000 and
c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..157858.09 rows=1 width=1008)
(actual time=378.414..378.414 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.075 ms
Execution time: 378.431 ms
(5 rows)
Set the max parallel degree to enable the use of parallelism in queries.
set max_parallel_degree = 6;
Explain analyze select * from tbl_parallel_test where c1 < 10000
and c2 like '%bb%';
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Gather (cost=1000.00..29701.57 rows=1 width=1008)
(actual time=182.708..182.708 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..28701.47 rows=1 width=1008)
(actual time=179.496..1081.120 rows=0 loops=1)
Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text))
Rows Removed by Filter: 1000000
Planning time: 0.078 ms
Execution time: 200.610 ms
(7 rows)
Here, we can see how changing max_parallel_degree allows the usage of parallel workers
to perform parallel sequential scans. We can notice in above example that even though we
have set max_parallel_degree as 6, still it uses 5 workers and the reason for same is that
currently the parallel workers are choosen based on size of relation.
Next, let us discuss about usage of functions in parallel query. A new clause PARALLEL
is added to the CREATE FUNCTION statement. There are three valid values that can be
used by user with this clause.
1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode
and the presence of such a function in a SQL statement forces a serial execution plan.
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode,
but the execution is restricted to parallel group leader. As of now, if the qualification for any
particular relation has anything that is parallel restricted, that relation won't be chosen for
parallelism.
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without
restriction.
The default value for function is PARALLEL Unsafe.
Now let us see the impact of using Parallel Safe and Unsafe function in the queries. I will
continue using the query used in previous example to explain the concept.
Create a Parallel Safe function
create or replace function calc_factorial(a integer, fact_val integer)
returns integer
as $$
begin
perform (fact_val)!;
return a;
end;
$$ language plpgsql PARALLEL Safe;
Use it in query Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Gather (cost=1000.00..75154.99 rows=1 width=1008)
(actual time=120566.456..120566.456 rows=0 loops=1)
Number of Workers: 5
-> Parallel Seq Scan on tbl_parallel_test
(cost=0.00..74154.89 rows=1 width=1008)
(actual time=119635.421..359721.498 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 54.904 ms
Execution time: 120622.631 ms
(7 rows)
Here we can see that Parallel Plan is chosen and the parallel safe function
is pushed to workers for evaluation of quals.
Now lets change that function as Parallel Unsafe and see how the above
query behaves.
Alter Function calc_factorial(integer, integer) PARALLEL Unsafe;
Explain analyze select * from tbl_parallel_test where
c1 < calc_factorial(10000, 10)
and c2 like '%bb%';
QUERY PLAN
--------------------------------------------------------------------------------
Seq Scan on tbl_parallel_test
(cost=0.00..407851.91 rows=1 width=1008)
(actual time=33166.138..33166.138 rows=0 loops=1)
Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10)))
Rows Removed by Filter: 1000000
Planning time: 0.162 ms
Execution time: 33166.208 ms
(5 rows)
So using parallel unsafe functions in queries would lead to serial plans.
Next, let us see the Performance characteristics of Parallelism:
Non-default settings used to collect performance data:
shared_buffers=32GB; min_wal_size=5GB; max_wal_size=10GB
checkpoint_timeout =30min; max_connections=300;
max_worker_processes=100;
Test setup
create table tbl_perf(c1 int, c2 char(1000));
insert into tbl_perf values(generate_series(1,30000000),'aaaaa');
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(1,10ドル) and
c2 like '%aa%';
The function calc_factorial is same as used in previous example and the values passedto it are such that the desired percentage of rows can be selected. Example
--"to select 1% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(29700000,10) and
c2 like '%aa%';"
--"to select 10% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(27000000,10) and
c2 like '%aa%';"
--"to select 25% of rows, below query can be used"
Explain analyze select c1 from tbl_perf where
c1 > calc_factorial(22500000,10) and
c2 like '%aa%';"
Performance Data -1. With increase in degree of parallelism (more parallel workers), the time to complete
the execution reduces.
2. Along with workers, master backend also participates in execution due to which you
can see more time reduction in some cases.
3. After certain point, increasing max parallel degree won't help.
The cases we have seen in this blog are mostly the cases where parallel query helps by
using the workers, however there exists some cases like when qualification is very cheap
where it hurts or won't help even by employing more number of workers. There is
more investigation needed to make sure that planner won't choose such plans for parallelism.
Wednesday, 13 May 2015
Extend Tar Format in pg_basebackup
Friday, 17 April 2015
Write Scalability in PostgreSQL
so writing this blog post.