Showing posts with label postgresql. Show all posts
Showing posts with label postgresql. Show all posts

Wednesday, 24 September 2025

Parallel Apply of Large Transactions

Logical replication in PostgreSQL has steadily evolved since its introduction in version 10. In a previous blog post, I discussed how PostgreSQL 14 introduced streaming of large transactions. PostgreSQL 16 took this further by enabling parallel apply of large transactions via a non-default subscription option. Now, with PostgreSQL 18, parallel apply is the default behavior—marking a significant milestone in replication performance and scalability.

In this post, we’ll explore how parallel apply works for large transactions and touch upon future plans for small transactions.

Enabling Parallel Apply

From PostgreSQL 18 onwards, parallel apply is enabled by default. To explicitly configure it, users can set the streaming parameter to parallel when creating a subscription:

postgres=# CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres' PUBLICATION pub WITH (streaming = parallel);
NOTICE: created replication slot "sub" on publisher
CREATE SUBSCRIPTION
postgres=# SELECT subname, substream FROM pg_subscription;
subname | substream
---------+-----------
sub | p
(1 row)

You can control the degree of parallelism using the max_parallel_apply_workers_per_subscription parameter. By default, this is set to 2, allowing two large transactions to be applied in parallel per subscription.

How Parallel Apply Works

Parallel apply significantly improves the efficiency of large transaction handling by not waiting for the COMMIT message from the publisher before applying changes. Previously, the subscriber would write all streamed data to temporary files and apply them only after receiving the commit record.




Architecture Overview

When the first stream of a large transaction arrives, the leader apply worker (LA) assigns a parallel apply worker (PA)—if available. The LA sends changes to the PA via a dedicated shared memory queue (shm_mq). The PA applies the changes and waits for the next stream. This worker remains assigned until the transaction is committed.

At commit time, the LA waits for the PA to finish to ensure commit order is preserved, which is critical to avoid:

  • Transaction dependencies: e.g., inserting a row in one transaction and updating it in another.
  • Deadlocks: e.g., transactions updating the same rows/tables in reverse order.

Locking Considerations

Parallel apply introduces potential deadlocks, especially when transactions are independent on the publisher but become interdependent on the subscriber due to schema differences (e.g., constraints, indexes).

To detect such deadlocks between LA and PA workers, PostgreSQL uses heavy-weight locks during:

  • PA waiting for the next stream
  • LA waiting for PA to finish

Example 1: Deadlock Due to Conflicting Changes

Consider a scenario where a table lacks a unique key on the publisher but has one on the subscriber. If:

  • PA is applying TX-1
  • LA is applying TX-2

...and both transactions conflict due to the unique constraint, a deadlock can occur:

  • LA waits for a lock on the unique index
  • PA waits for the next stream from LA
  • Deadlock cycle: LA → PA → LA

PostgreSQL detects and reports such deadlocks. Users can skip problematic transactions using: ALTER SUBSCRIPTION ... SKIP ... command.

Example 2: Deadlock Due to shm_mq Buffer Full

In a variant of the previous example:

  • PA-1 and PA-2 apply TX-1 and TX-2 respectively.
  • PA-2 waits for PA-1 to finish.
  • PA-1 waits for input from LA.
  • If the shm_mq buffer between LA and PA-2 is full, LA cannot send messages.

To avoid this:

  • LA uses non-blocking writes with a timeout.
  • If the timeout is exceeded, LA serializes pending messages to a file.
  • PA-2 reads from the file, freeing LA to continue processing PA-1.

This mechanism prevents deadlocks caused by full shm_mq buffers.

Performance

Synchronous logical replication system
Setup: Publisher → Subscriber on the same machine with:

Non-default GUC settings:
shared_buffers = 100GB
checkpoint_timeout = 30min
max_wal_size = 20GB
min_wal_size = 10GB
autovacuum = off
synchronous_commit = remote_apply
logical_decoding_work_mem = 30MB

Workload: 1 million to 10 million inserts on a single table
Time measured: Time taken from insert to commit


Results:
Elapsed time improved by ~2X with parallel apply.
As the number of tuples increases, parallel apply provides greater benefit since transactions are applied immediately, significantly reducing replication lag between publisher and subscriber.

What’s Next?

With parallel apply now default for large transactions, the next step is to extend this capability to small transactions. This will further enhance logical replication performance and scalability in PostgreSQL. The community has started discussion on it.

Sunday, 10 May 2020

Improved (auto)vacuum in PostgreSQL 13

Vacuum is one of the sub-systems in PostgreSQL which gets improved with each release. I have checked past five releases and each has quite a few improvements for vacuum. Following the trend, there are a number of improvements in vacuum in the upcoming PostgreSQL release (v13) which are covered in this blog.

Improvement-1
---------------------
Vacuum will be allowed to process indexes in parallel. This means it can leverage multiple CPUs in order to perform index cleanup. This internally uses the background worker infrastructure of PostgreSQL to accomplish the work. There is a new option PARALLEL which controls the parallelism used by vacuum. Users can use the new option to specify the number of workers that can be used to perform the vacuum command which is limited by the minimum of (a) the number of indexes on a table and (b) max_parallel_maintenance_workers. The parallelism for vacuum command is enabled by default which means it will be used even if the user didn't specify the PARALLEL option and it uses the number of workers equal to the number of indexes on the table being vacuumed. We can disable parallelism for vacuum by specifying zero as the number of workers with the PARALLEL option. The index can participate in parallel
vacuum iff it's size is greater than min_parallel_index_scan_size.

The PARALLEL option can't be used with the FULL option in vacuum command. This feature won't be available via autovacuum, users need to use vacuum command to get the benefit provided by this feature.

There are a number of other blogs [1][2] written on this topic which shows the benefit of this feature. A recent blog published by EnterpriseDB shows that vacuum could be approximately 4 times faster by using 7 workers especially when the relation is in dire need of vacuum, read that blog for more information about the test.

Here is a simple test to show the usage of this feature.
postgres=# create table pvac(c1 int, c2 text DEFAULT md5(random()::text), c3 text DEFAULT md5(random()::text));
CREATE TABLE
postgres=# create index pvac_1 on pvac(c1);
CREATE INDEX
postgres=# create index pvac_2 on pvac(c2);
CREATE INDEX
postgres=# create index pvac_3 on pvac(c3);
CREATE INDEX
postgres=# insert into pvac select i FROM generate_series(1,100000) as i;
INSERT 0 100000
postgres=# update pvac set c1=c1;
UPDATE 100000

postgres=# vacuum (parallel 4, verbose) pvac;
INFO: vacuuming "public.pvac"
INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO: scanned index "pvac_1" to remove 100000 row versions
..
..

In the above test, 2 parallel workers were used even though we have specified 4 and the reason is that the number of indexes is 3, and the number of workers can't be more than index which is further limited by max_parallel_maintenance_workers as you can see by command below.

postgres=# show max_parallel_maintenance_workers;
max_parallel_maintenance_workers
----------------------------------
2
(1 row)

The other most important thing we have ensured in this feature is that the parallelism won't use more memory or I/O bandwidth as compared to non-parallel vacuum.

Improvement-2
---------------------
Allow inserts to trigger autovacuum activity. This feature will be really helpful for the insert-only tables where anti-wraparound vacuums could be the first vacuum that the table ever receives and such a run would take a really long time. This allows heap pages to be set as all-visible, which then allows index-only scans to skip heap fetches, and reduces the work necessary when the table needs to be frozen. This is controlled by two new GUCs and reloptions; autovacuum_vacuum_insert_threshold and autovacuum_vacuum_insert_scale_factor.

Let me demonstrate the benefit of this feature with the help on an example. Start the server with autovacuum_vacuum_insert_threshold = -1 (one can edit postgresql.conf file to change value of this parameter or can use Alter System command). By connecting with psql, we can execute below commands to see the behavior.

postgres=# show autovacuum_vacuum_insert_threshold;
autovacuum_vacuum_insert_threshold
------------------------------------
-1
(1 row)

postgres=# create table vac_ins(c1 int, c2 char(500));
CREATE TABLE
postgres=# create index idx_vac_ins on vac_ins(c1);
CREATE INDEX
postgres=# insert into vac_ins values(generate_series(1,20000),'aaaaaa');
INSERT 0 20000

After a few seconds, you can notice the below message in the server log which shows that autovacuum has performed analyze on the table.

LOG: automatic analyze of table "postgres.public.vac_ins" system usage: CPU: user: 0.03 s, system: 0.11 s, elapsed: 0.15 s

After that, run below command:

postgres=# explain (analyze) select c1 from vac_ins where c1 < 100;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using idx_vac_ins on vac_ins (cost=0.29..16.02 rows=99 width=4) (actual time=0.019..0.092 rows=99 loops=1)
Index Cond: (c1 < 100)
Heap Fetches: 99
Planning Time: 0.269 ms
Execution Time: 0.129 ms
(5 rows)

Here, "Heap Fetches: 99" shows that the above query need to visit heap to fetch the required information even though it is present in index and the scan type used is Index Only Scan.

postgres=# truncate vac_ins;
TRUNCATE TABLE

Now, restart the server with a default value of autovacuum_vacuum_insert_threshold and execute below commands from psql to see how the new feature helps:

postgres=# show autovacuum_vacuum_insert_threshold;
autovacuum_vacuum_insert_threshold
------------------------------------
1000
(1 row)

postgres=# insert into vac_ins values(generate_series(1,20000),'aaaaaa');
INSERT 0 20000

After a few seconds, you can notice the below message in the server log which indicates that autovacuum has performed both vacuum and analyze on the table.
LOG: automatic vacuum of table "postgres.public.vac_ins": index scans: 0
pages: 0 removed, 1334 remain, 0 skipped due to pins, 0 skipped frozen
...
LOG: automatic analyze of table "postgres.public.vac_ins" system usage: CPU: user: 0.01 s, system: 0.15 s, elapsed: 0.21 s

After that, run below command:

postgres=# explain (analyze) select c1 from vac_ins where c1 < 100;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Index Only Scan using idx_vac_ins on vac_ins (cost=0.29..6.02 rows=99 width=4) (actual time=0.016..0.039 rows=99 loops=1)
Index Cond: (c1 < 100)
Heap Fetches: 0
Planning Time: 0.166 ms
Execution Time: 0.074 ms
(5 rows)

Here, "Heap Fetches: 0" shows that the above query doesn't need to visit heap to fetch the required information. We can see that the time to complete the execution is reduced significantly in this case.

Improvement-3
---------------------
Allow an (auto)vacuum to display additional information about the heap or index in case of an error. This feature could help users in case the database has some corruption. For example, if one of the indexes on a relation has some corrupted data (due to bad hardware or some bug), it will let the user know the index information, and the user can take appropriate action like either Reindex or maybe drop and recreate that particular index to overcome the problem. In the case of the heap, it displays the block number for which an error has occurred which makes it much easier for users and developers to detect the problem. In the worst case, if any particular block is corrupted in a table, users can remove all the rows of that particular block and the table can be used.

vacuum pvac;
ERROR: error induced to demonstrate use of information
CONTEXT: while scanning block 2469 of relation "public.pvac"

Next, I used "SELECT * FROM heap_page_items(get_raw_page('pvac', 2469));" to find the information of all line pointers in the page and then removed them from using below query.

postgres=# delete from pvac where ctid Between '(2469,1)' and '(2469,11)';
DELETE 11

Then, I again ran vacuum

postgres=# vacuum pvac;
ERROR: error induced to demonstrate use of information
CONTEXT: while scanning block 2468 of relation "public.pvac"

Now, you can see the block number is changed from 2469 to 2468 which means the vacuum could proceed. As, in this case, I have manually induced the error by changing code, so it occurs for every block but in reality it would be for some particular block(s) and once user can get rid of those block(s), the table can be reused. I don't want to say that is an ideal situation but at least it will allow users to proceed and it can help developers to narrow down the bug if there is any in the code.

Improvement-4
---------------------
Autovacuum will now log WAL usage statistics along with other information. The WAL usage contains information on the total number of records, number of full pages images, and the total number of bytes. The buffer usage and WAL usage stats combined gives us approximate usage of I/O by a particular autovacuum run. I took one example runs information which will be shown below, one can refer to "WAL usage" to check the information on newly added stats.

LOG: automatic vacuum of table "postgres.public.pvac": index scans: 1
pages: 0 removed, 2470 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 100000 removed, 100000 remain, 0 are dead but not yet removable, oldest xmin: 529
buffer usage: 9276 hits, 4 misses, 3 dirtied
avg read rate: 0.062 MB/s, avg write rate: 0.047 MB/s
system usage: CPU: user: 0.40 s, system: 0.00 s, elapsed: 0.50 s
WAL usage: 7909 records, 2 full page images, 2276323 bytes

Improvement-5
---------------------
Make vacuum buffer counters 64-bits wide to avoid overflow of buffer usage stats. Without this feature, in extreme cases, if there exist tables that are large enough for 4 billion buffer accesses to be a possibility, the stats displayed are meaningless. See below example of 'buffer usage' stats from pgsql-hackers:

LOG: automatic vacuum of table "somtab.sf.foobar": index scans: 17
pages: 0 removed, 207650641 remain, 0 skipped due to pins, 13419403 skipped frozen
tuples: 141265419 removed, 3186614627 remain, 87783760 are dead but not yet removable
buffer usage: -2022059267 hits, -17141881 misses, 1252507767 dirtied
avg read rate: -0.043 MB/s, avg write rate: 3.146 MB/s
system usage: CPU 107819.92s/2932957.75u sec elapsed 3110498.10 sec

Improvement-6
---------------------
Add wait event 'VacuumDelay' to report on cost-based vacuum delay. This will help us to monitor the [auto]vacuum throttling. This is a small feature but quite important as right now we have no way to monitor [auto]vacuum throttling and it is not very uncommon to see this in user environments.

Wednesday, 19 February 2020

Parallelism, what next?

This blog post is about the journey of parallelism in PostgreSQL till now and what is in store for the future. Since PostgreSQL 9.6 where the first feature of parallel query has arrived, each release improves it. Below is a brief overview of the parallel query features added in each release.

PG9.6 has added Parallel execution of sequential scans, joins, and aggregates.

PG10 has added (a) Support parallel B-tree index scans, (b) Support parallel bitmap heap scans, (c) Allow merge joins to be performed in parallel, (d) Allow non-correlated subqueries to be run in parallel, (e) Improve ability of parallel workers to return pre-sorted data and (f) Increase parallel query usage in procedural language functions.

PG11 has added (a) Allow parallel building of a btree index, (b) Allow hash joins to be performed in parallel using a shared hash table, (c) Allow parallelization of commands CREATE TABLE ... AS, SELECT INTO, and CREATE MATERIALIZED VIEW, (d) Allow UNION to run each SELECT in parallel if the individual SELECTs cannot be parallelized, (e) Allow partition scans to more efficiently use parallel workers, (f) Allow LIMIT to be passed to parallel workers, this allows workers to reduce returned results and use targeted index scans, (g) Allow single-evaluation queries, e.g. WHERE clause aggregate queries, and functions in the target list to be parallelized.

PG12 has added Allow parallelized queries when in SERIALIZABLE isolation mode.

The progress for PG13 with respect to parallelism. Some of the important advancements are:
(a) Parallel vacuum - This feature allows the vacuum to leverage multiple CPUs in order to process indexes. This enables us to perform index vacuuming and index cleanup with background workers. This adds a PARALLEL option to VACUUM command where the user can specify the number of workers that can be used to perform the command which is limited by the number of indexes on a table. Specifying zero as a number of workers will disable parallelism. For more information, see commit.

(b) Improve EXPLAIN's handling of per-worker details. This allows displaying the worker information in a much better way. The few visible side-effects as mentioned in the commit

* In text format, instead of something like

Sort Method: external merge Disk: 4920kB
Worker 0: Sort Method: external merge Disk: 5880kB
Worker 1: Sort Method: external merge Disk: 5920kB
Buffers: shared hit=682 read=10188, temp read=1415 written=2101
Worker 0: actual time=130.058..130.324 rows=1324 loops=1
Buffers: shared hit=337 read=3489, temp read=505 written=739
Worker 1: actual time=130.273..130.512 rows=1297 loops=1
Buffers: shared hit=345 read=3507, temp read=505 written=744

you get

Sort Method: external merge Disk: 4920kB
Buffers: shared hit=682 read=10188, temp read=1415 written=2101
Worker 0: actual time=130.058..130.324 rows=1324 loops=1
Sort Method: external merge Disk: 5880kB
Buffers: shared hit=337 read=3489, temp read=505 written=739
Worker 1: actual time=130.273..130.512 rows=1297 loops=1
Sort Method: external merge Disk: 5920kB
Buffers: shared hit=345 read=3507, temp read=505 written=744

(c) Avoid unnecessary shm writes in Parallel Hash Join. This improves the performance of Parallel Hash Join by a significant amount on large systems running many-core joins. Though this work has been back-patched to v11 where Parallel Hash Join was introduced, I mentioned it here as it is done during PG13 development. For more information, see commit.

What is being discussed for the future:
(a) Parallel grouping sets - PostgreSQL already supports parallel aggregation by aggregating in two stages. First, each process participating in the parallel portion of the query performs an aggregation step, producing a partial result for each group of which that process is aware. Second, the partial results are transferred to the leader via the Gather node. Finally, the leader re-aggregates the results across all workers in order to produce the final result.

Next, there has been a discussion in the community to parallelize queries containing grouping sets in much the same way as we do parallel aggregation.
Basically, the aim is to parallelize queries like SELECT brand, size, sum(sales) FROM items_sold GROUP BY GROUPING SETS ((brand), (size), ());
This feature has been proposed for PG13, but yet not committed.

(b) Parallel copy - We also want to parallelize the Copy command, in particular "Copy <table_name> from .. ;" command. This will help improve the bulk load operation in PostgreSQL. Currently, we do a lot of work during the Copy command. We read the file in 64KB chunks, then find the line endings and process that data line by line, where each line corresponds to one tuple. We first form the tuple (in form of value/null array) from that line, check if it qualifies the where condition and if it qualifies, then perform constraint check and few other checks and then finally store it in local tuple array. Once we reach 1000 tuples or consumed 64KB (whichever occurred first), we insert them together and then for each tuple insert into the index(es) and execute after row triggers. The aim of this work is to parallelize as much as possible the work done during the copy. There is an ongoing discussion in the community on this topic.

There is a small caveat here that to achieve parallel copy, we need to work on relation extension lock where parallel workers block each other while extending the relation which is not the case currently. There is already a discussion on this topic in the community.

(c) Parallel file_fdw - The proposed work in this area allows file_fdw to divide its scan up for parallel workers, much like a parallel seq scan.

There are more areas where parallelism can be used like parallel DMLs (inserts, updates, deletes). During a discussion with Thomas Munro, it came up that it would be beneficial if we can parallelize index creation and index scans for indexes other than btree especially gin and gist. Note that we already support parallel index scans and parallel index creation for btree. I would not like to go in detail of these operations as till now we haven't seen any proposal for those. Similarly, we can improve few things in our current parallel infrastructure (a) As of now, for each query the parallel workers are created and destroyed, instead we can have some pool of parallel query workers which can avoid the overhead of starting them for each query, (b) As of now, each worker can use up to work_mem of memory which might increase the overall memory usage of query. We might want to improve this, but currently, there is no proposal for this.

Friday, 25 May 2018

Parallel Index Scans In PostgreSQL


There is a lot to say about parallelism in PostgreSQL. We have come a long way since I wrote my first post on this topic (Parallel Sequential Scans). Each of the past three releases (including PG-11, which is in its beta) have a parallel query as a major feature which in itself says how useful is this feature and the amount of work being done on this feature. You can read more about parallel query from the PostgreSQL docs or from a blog post on this topic by my colleague Robert Haas. The intent of this blog post is to talk about parallel index scans which were released in PostgreSQL 10. Currently, we have supported parallel scan for btree-indexes.

To demonstrate how the feature works, here is an example of TPC-H Q-6 at scale factor - 20 (which means approximately 20GB database). Q6 is a forecasting revenue change query. This query quantifies the amount of revenue increase that would have resulted from eliminating certain company-wide discounts in a given percentage range in a given year. Asking this type of "what if" query can be used to look for ways to increase revenues.

explain analyze
select sum(l_extendedprice * l_discount) as revenue
from lineitem
where l_shipdate >= date '1994-01-01' and
l_shipdate < date '1994-01-01' + interval '1' year and
l_discount between 0.02 - 0.01 and 0.02 + 0.01 and
l_quantity < 24
LIMIT 1;

Non-parallel version of plan
-------------------------------------
Limit
-> Aggregate
-> Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND
(l_discount <= 0.03) AND (l_quantity < '24'::numeric))
Planning Time: 0.406 ms
Execution Time: 35073.886 ms

Parallel version of plan
-------------------------------
Limit
-> Finalize Aggregate
-> Gather
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate
-> Parallel Index Scan using idx_lineitem_shipdate on lineitem
Index Cond: ((l_shipdate >= '1994-01-01'::date) AND (l_shipdate < '1995-01-01
00:00:00'::timestamp without time zone) AND (l_discount >= 0.01) AND
(l_discount <= 0.03) AND (l_quantity < '24'::numeric))
Planning Time: 0.420 ms
Execution Time: 15545.794 ms

We can see that the execution time is reduced by more than half for a parallel plan with two parallel workers. This query filters many rows and the work (CPU time) to perform that is divided among workers (and leader), leading to reduced time.

To further see the impact with a number of workers, we have used somewhat bigger dataset (scale_factor = 50). The setup has been done using TPC-H like benchmark for PostgreSQL. We have also created few additional indexes on columns (l_shipmode, l_shipdate, o_orderdate, o_comment)

Non-default parameter settings:
random_page_cost = seq_page_cost = 0.1
effective_cache_size = 10GB
shared_buffers = 8GB
work_mem = 1GB




The time is reduced almost linearly till 8 workers and then it reduced slowly. The further increase in workers won’t help unless the data to scan increases.

We have further evaluated the parallel index scan feature for all the queries in TPC-H benchmark and found that it is used in a number of queries and the impact is positive (reduced the execution time significantly). Below are results for TPC-H, scale factor - 20 with a number of parallel workers as 2. X-axis indicates (1: Q-6, 2: Q14, 3: Q18).


Under the Hood
The basic idea is quite similar to parallel heap scans where each worker (including leader whenever possible) will scan a block (all the tuples in a block) and then get the next block that is required to be scan. The parallelism is implemented at the leaf level of a btree. The first worker to start a btree scan will scan till it reaches the leaf and others will wait till the first worker has reached the leaf. Once, the first worker read the leaf block, it sets the next block to be read and wakes one of the workers waiting to scan blocks. Further, it proceeds scanning tuples from the block it has read. Henceforth, each worker after reading a block sets the next block to be read and wakes up the next waiting worker. This continues till no more pages are left to scan at which we end the parallel scan and notify all the workers.

A new guc min_parallel_index_scan_size has been introduced which indicates the minimum amount of index data that must be scanned in order for a parallel scan to be considered. Users can try changing the value of this parameter to see if the parallel index plan is effective for their queries. The number of parallel workers is decided based on the number of index pages to be scanned. The final cost of parallel plan considers the cost (CPU cost) to process the rows will be divided equally among workers.

In the end, I would like to thank the people (Rahila Syed and Robert Haas) who were involved in this work (along with me) and my employer EnterpriseDB who has supported this work. I would also like to thank Rafia Sabih who helped me in doing performance testing for this blog.

Monday, 5 March 2018

zheap: a storage engine to provide better control over bloat

In the past few years, PostgreSQL has advanced a lot in terms of features, performance, and scalability for many-core systems. However, one of the problems that many enterprises still complain is that its size increases over time which is commonly referred to as bloat. PostgreSQL has a mechanism known as autovacuum wherein a dedicated process (or set of processes) tries to remove the dead rows from the relation in an attempt to reclaim the space, but it can’t completely reclaim the space in many cases. In particular, it always creates a new version of a tuple on an update which must eventually be removed by periodic vacuuming or by HOT-pruning, but still in many cases space is never reclaimed completely. A similar problem occurs for tuples that are deleted. This leads to bloat in the database. My colleague Robert Haas has discussed some such cases in his blog DO or UNDO - there is no VACUUM where the PostgreSQL heap tends to bloat and has also mentioned the solution (zheap: a new storage format for PostgreSQL) on which EnterpriseDB is working to avoid the bloat whenever possible. The intent of this blog post is to elaborate on that work in some more detail and show some results.

This project has three major objectives:

1. Provide better control over bloat. zheap will prevent bloat (a) by allowing in-place updates in common cases and (b) by reusing space as soon as a transaction that has performed a delete or non-in-place update has committed. In short, with this new storage, whenever possible, we’ll avoid creating bloat in the first place.

2. Reduce write amplification both by avoiding rewrites of heap pages and by making it possible to do an update that touches indexed columns without updating every index.

3. Reduce the tuple size by (a) shrinking the tuple header and (b) eliminating most alignment padding.

In this blog post, I will mainly focus on the first objective (Provide better control over bloat) and leave other things for future blog posts on this topic.

In-place updates will be supported except when (a) the new tuple is larger than the old tuple and the increase in size makes it impossible to fit the larger tuple onto the same page or (b) some column is modified which is covered by an index that has not been modified to support “delete-marking”. Note that the work to support delete-marking in indexes is yet to start and we intend to support it at least for btree indexes. For in-place updates, we have to write the old tuple in the undo log and the new tuple in the zheap which help concurrent readers to read the old tuple from undo if the latest tuple is not yet visible to them.

Deletes write the complete tuple in the undo record even though we could get away with just writing the TID as we do for an insert operation. This allows us to reuse the space occupied by the deleted record as soon as the transaction that has performed the operation commits. Basically, if the delete is not yet visible to some concurrent transaction, it can read the tuple from undo and in heap, we can immediately (as soon as the transaction commits) reclaim the space occupied by the record.

Below are some of the graphs that compare the size of heap and zheap table when the table is constantly updated and there is a concurrent long-running transaction. To perform these tests, we have used pgbench to initialize the data (at scale factor 1000) and then use the simple-update test (which comprises of one-update, one-select, one-insert) to perform updates. You can refer to the PostgreSQL manual for more about how to use pgbench . These tests have been performed on a machine with an x86_64 architecture, 2-sockets, 14-cores per socket, 2-threads per-core and has 64-GB RAM. The non-default configuration for the tests is shared_buffers=32GB, min_wal_size=15GB, max_wal_size=20GB, checkpoint_timeout=1200, maintenance_work_mem=1GB, checkpoint_completion_target=0.9, synchoronous_commit = off. The below graphs show the size of the table on which this test has performed updates.





In the above test, we can see that the initial size of the table was 13GB in heap and 11GB in zheap. After running the test for 25 minutes (out of which there was an open transaction for first 15-minutes), the size in heap grows to 16GB at 8-client count test and to 20GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of the test. The initial size of zheap is lesser because the tuple header size is smaller in zheap. Now, certainly for first 15 minutes, autovacuum can’t reclaim any space due to the open transaction, but it can’t reclaim it even after the open transaction is ended. On the other hand, the size of zheap remains constant and all the undo data generated is removed within seconds of the transaction ending.

Below are some more tests where the transaction has been kept open for a much longer duration.

After running the test for 40 minutes (out of which there was an open transaction for first 30-minutes), the size in heap grows to 19GB at 8-client count test and to 26GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of test and all the undo generated during test gets discarded within a few seconds after the open transaction is ended.

After running the test for 55 minutes (out of which there was an open transaction for first 45-minutes), the size in heap grows to 22GB at 8-client count test and to 28GB at 64-client count test whereas for zheap the size remains at 11GB for both the client-counts at the end of test and all the undo generated during test gets discarded within few seconds after the open transaction is ended.

So from all the above three tests, it is clear that the size of heap keeps on growing as the time for a concurrent long-running transaction is increasing. It was 13GB at the start of the test, grew to 20GB, then to 26GB, then to 28GB at 64-client count test as the duration of the open transaction has increased from 15-mins to 30-mins and then to 45-mins. We have done a few more tests on the above lines and found that as the duration of open-transaction increases, the size of heap keeps on increasing whereas zheap remains constant. For example, similar to above, if we keep the transaction open 60-mins in a 70-min test, the size of heap increases to 30GB. The increase in size also depends on the number of updates that are happening as part of the test.

The above results show not only the impact on size, but we also noticed that the TPS (transactions per second) in zheap is also always better (up to ~45%) for the above tests. In similar tests on some other high-end machine, we see much better results with zheap with respect to performance. I would like to defer the details about raw-performance of zheap vs. heap to another blog post as this blog has already become big. I would like to mention that the above results don't mean that zheap will be better in all cases than heap. For example, rollbacks will be costlier in zheap. Just to be clear, this storage format is proposed as another format alongside current heap, so that users can decide which storage they want to use for their use case.

The code for this project has been published and is proposed as a feature for PG-12 to PostgreSQL community. Thanks to Kuntal Ghosh for doing the performance tests mentioned in this blog post.

Friday, 17 March 2017

Hash indexes are faster than Btree indexes?


PostgreSQL have supported Hash Index for a long time, but they are not much used in production mainly because they are not durable. Now, with the next version of PostgreSQL, they will be durable. The immediate question is how do they perform as compared to Btree indexes. There is a lot of work done in the coming version to make them faster. There are multiple ways in which we can compare the performance of Hash and Btree indexes, like the time taken for creation of the index, search or insertion in the index. This blog will mainly focus on the search operation. By definition, hash indexes are O(1) and Btree indexes are O(log n), however with duplicates that is not exactly true.

To start with let us see the impact of work being done to improve the performance of hash indexes. Below is the performance data of the pgbench read-only workload to compare the performance difference of Hash indexes between 9.6 and HEAD on IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM.




The workload is such that all the data fits in shared buffers (scale factor is 300 (~4.5GB) and shared_buffers is 8GB). As we can see from the above graph, that the performance has increased at all client counts in the range of 7% to 81% and the impact is more pronounced at higher client counts. The main work which has led to this improvement is 6d46f478 (Improve hash index bucket split behavior.) and 293e24e5 (Cache hash index's metapage in rel->rd_amcache.).

The first commit 6d46f478 has changed the heavyweight locks (locks that are used for logical database objects to ensure the database ACID properties) to lightweight locks (locks to protect shared data structures) for scanning the bucket pages. In general, acquiring the heavyweight lock is costlier as compare to lightweight locks. In addition to reducing the locking cost, this also avoids locking out scans and inserts for the lifetime of the split.

The second commit 293e24e5 avoids a significant amount of contention for accessing metapage. Each search operation needs to access metapage to find the bucket that contains tuple being searched which leads to high contention around metapage. Each access to metapage needs to further access buffer manager. This work avoids that contention by caching the metapage information in backend local cache which helps bypassing all the buffer manager related work and hence the major contention in accessing the metapage.


The next graph shows how the hash index performs as compared to the btree index. In this run we have changed hash to btree index in pgbench read-only tests.



We can see here that the hash index performs better than the btree index and the performance difference is in the range of 10 to 22%. In some other workloads we have seen a better performance like with hash index on varchar columns and even in the community, it has been reported that there is performance improvement in the range of 40-60% when hash indexes are used for unique index columns.


The important thing to note about the above data is that it is only on some of the specific workloads and it mainly covers Selects as that is the main area where performance improvement work has been done for PostgreSQL10. The other interesting parameters to compare are the size of the index and update on the index which needs more study and experiments.

In the end, I would like to thank my colleagues who were directly involved in this work and my employer EnterpriseDB who has supported this work. Firstly I would like to thank, Robert Haas who has envisioned all this work and is the committer of this work, and Mithun C Y who was the author of commit 293e24e5. Also, I would like to extend sincere thanks to all the community members who are involved in this work and especially Jeff Janes and Jesper Pedersen who have reviewed and tested this work.

Sunday, 29 November 2015

Parallel Sequential Scans in play


Parallelism is now reality in PostgreSQL. With 9.6, I hope we will see many
different form of queries that can use parallelism to execute. For now, I will
limit this discussion to what we can already do, which is Parallel Sequential
Scans.

Parallel Sequential Scans are used to scan a relation parallely with the help of
background workers which in turns improve the performance of such scans. I
will discuss about the scenarios where user can expect a performance boost
due to this feature later in this blog, but first let us understand the basic feature
and how it works. Three new GUC parameters have been added to tune the
usage of this feature.

max_parallel_degree - This is used to set the maximum number of workers that
can be used for an individual parallel operation. It is very well possible that the
requested number of workers are not available at execution time. Parallel workers
are taken from the pool of processes established by max_worker_processes which
means that value of max_parallel_degree should be lesser than max_worker_processes.
It might not be useful to set the value of this parameter more than the number of CPU
count on your system.

parallel_tuple_cost - This is used by planner to estimate the cost of transferring a
tuple from parallel worker process to master backend. The default is 0.1. The more
the number of tuples that needs to be passed from worker backend processes to
master backend process, the more this cost will be and more overall cost of
parallel sequential scan plan.

parallel_setup_cost - This is used by planner to estimate the cost of launching parallel
worker processes and setting up dynamic shared memory to communicate.
The default is 1000.

Now let us see the simple example to demonstrate how parallel sequential scan works:
 create table tbl_parallel_test(c1 int, c2 char(1000)); 
 insert into tbl_parallel_test values(generate_series(1,1000000),'aaaaa'); 
 Analyze tbl_parallel_test; 
 Explain analyze select * from tbl_parallel_test where c1 < 10000 and 
 c2 like '%bb%'; 
 QUERY PLAN 
 ------------------------------------------------------------------------------------------------------------- 
 Seq Scan on tbl_parallel_test 
 (cost=0.00..157858.09 rows=1 width=1008) 
 (actual time=378.414..378.414 rows=0 loops=1) 
 Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text)) 
 Rows Removed by Filter: 1000000 
 Planning time: 0.075 ms 
 Execution time: 378.431 ms 
 (5 rows) 

Set the max parallel degree to enable the use of parallelism in queries.
 set max_parallel_degree = 6; 
 Explain analyze select * from tbl_parallel_test where c1 < 10000 
 and c2 like '%bb%'; 
 QUERY PLAN 
 ------------------------------------------------------------------------------------------------------------- 
 Gather (cost=1000.00..29701.57 rows=1 width=1008) 
 (actual time=182.708..182.708 rows=0 loops=1) 
 Number of Workers: 5 
 -> Parallel Seq Scan on tbl_parallel_test 
 (cost=0.00..28701.47 rows=1 width=1008) 
 (actual time=179.496..1081.120 rows=0 loops=1) 
 Filter: ((c1 < 10000) AND (c2 ~~ '%bb%'::text)) 
 Rows Removed by Filter: 1000000 
 Planning time: 0.078 ms 
 Execution time: 200.610 ms 
 (7 rows) 

Here, we can see how changing max_parallel_degree allows the usage of parallel workers
to perform parallel sequential scans. We can notice in above example that even though we
have set max_parallel_degree as 6, still it uses 5 workers and the reason for same is that
currently the parallel workers are choosen based on size of relation.

Next, let us discuss about usage of functions in parallel query. A new clause PARALLEL
is added to the CREATE FUNCTION statement. There are three valid values that can be
used by user with this clause.

1. PARALLEL Unsafe - This indicates that the function can't be executed in parallel mode
and the presence of such a function in a SQL statement forces a serial execution plan.
2. PARALLEL Restricted - This indicates that the function can be executed in parallel mode,
but the execution is restricted to parallel group leader. As of now, if the qualification for any
particular relation has anything that is parallel restricted, that relation won't be chosen for
parallelism.
3. Parallel Safe - This indicates that the function is safe to run in parallel mode without
restriction.

The default value for function is PARALLEL Unsafe.

Now let us see the impact of using Parallel Safe and Unsafe function in the queries. I will
continue using the query used in previous example to explain the concept.

Create a Parallel Safe function
 create or replace function calc_factorial(a integer, fact_val integer) 
 returns integer 
 as $$ 
 begin 
 perform (fact_val)!; 
 return a; 
 end; 
 $$ language plpgsql PARALLEL Safe; 
Use it in query
 Explain analyze select * from tbl_parallel_test where 
 c1 < calc_factorial(10000, 10) 
 and c2 like '%bb%'; 
 QUERY PLAN 
 -------------------------------------------------------------------------------- 
 Gather (cost=1000.00..75154.99 rows=1 width=1008) 
 (actual time=120566.456..120566.456 rows=0 loops=1) 
 Number of Workers: 5 
 -> Parallel Seq Scan on tbl_parallel_test 
 (cost=0.00..74154.89 rows=1 width=1008) 
 (actual time=119635.421..359721.498 rows=0 loops=1) 
 Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10))) 
 Rows Removed by Filter: 1000000 
 Planning time: 54.904 ms 
 Execution time: 120622.631 ms 
 (7 rows) 

Here we can see that Parallel Plan is chosen and the parallel safe function
is pushed to workers for evaluation of quals.

Now lets change that function as Parallel Unsafe and see how the above
query behaves.

 Alter Function calc_factorial(integer, integer) PARALLEL Unsafe; 
 Explain analyze select * from tbl_parallel_test where 
 c1 < calc_factorial(10000, 10) 
 and c2 like '%bb%'; 
 QUERY PLAN 
 -------------------------------------------------------------------------------- 
 Seq Scan on tbl_parallel_test 
 (cost=0.00..407851.91 rows=1 width=1008) 
 (actual time=33166.138..33166.138 rows=0 loops=1) 
 Filter: ((c2 ~~ '%bb%'::text) AND (c1 < calc_factorial(10000, 10))) 
 Rows Removed by Filter: 1000000 
 Planning time: 0.162 ms 
 Execution time: 33166.208 ms 
 (5 rows) 

So using parallel unsafe functions in queries would lead to serial plans.

Next, let us see the Performance characteristics of Parallelism:

Non-default settings used to collect performance data:
 shared_buffers=32GB; min_wal_size=5GB; max_wal_size=10GB 
 checkpoint_timeout =30min; max_connections=300; 
 max_worker_processes=100; 

Test setup
 create table tbl_perf(c1 int, c2 char(1000)); 
 insert into tbl_perf values(generate_series(1,30000000),'aaaaa'); 
 Explain analyze select c1 from tbl_perf where 
 c1 > calc_factorial(1,10ドル) and 
 c2 like '%aa%'; 
The function calc_factorial is same as used in previous example and the values passed
to it are such that the desired percentage of rows can be selected. Example
 --"to select 1% of rows, below query can be used" 
 Explain analyze select c1 from tbl_perf where 
 c1 > calc_factorial(29700000,10) and 
 c2 like '%aa%';" 
 --"to select 10% of rows, below query can be used" 
 Explain analyze select c1 from tbl_perf where 
 c1 > calc_factorial(27000000,10) and 
 c2 like '%aa%';" 
 --"to select 25% of rows, below query can be used" 
 Explain analyze select c1 from tbl_perf where 
 c1 > calc_factorial(22500000,10) and 
 c2 like '%aa%';" 
Performance Data -





















1. With increase in degree of parallelism (more parallel workers), the time to complete
the execution reduces.
2. Along with workers, master backend also participates in execution due to which you
can see more time reduction in some cases.
3. After certain point, increasing max parallel degree won't help.

The cases we have seen in this blog are mostly the cases where parallel query helps by
using the workers, however there exists some cases like when qualification is very cheap
where it hurts or won't help even by employing more number of workers. There is
more investigation needed to make sure that planner won't choose such plans for parallelism.

Wednesday, 13 May 2015

Extend Tar Format in pg_basebackup


As of now, one can't reliably use tar format to take backup on Windows
because it can't restore tablespaces data which is stored in form of symbolic
links in <data_directory>/pg_tblspc/. The reason for the same is that native
windows utilites are not able to create symbolic links while extracting files
from tar. It might be possible to create symbolic links if cygwin is installed
on your system, however we need this feature to work for native windows as
well.

In PostgreSQL 9.5, a new feature (commit id - 72d422a5) to extend existing
tar format made it possible to reliably take the backup (in tar mode).
From user perspective, there is nothing much that is changed to take the backup
except that tar format mode (--format=tar) in pg_basebackup (of the PostgreSQL
9.5 version) will only work with server version 9.5 or later. This feature is mainly
required for windows, but for the sake consistency it has been changed for all
platforms and also it should enable long (length greater than 99) target symbolic
link for tar format (I think the changes for same are still not done, but we can do
the same now as this feature is committed).

The basic idea behind the feature is that it forms the tablespace map of
all the tablespace symbolic links that are present inside
<data_directory>/pg_tblspc/ and store the same in data_directory for
Exclusive backups (aka backups taken via pg_start_backup() and
pg_stop_backup() functions) and store in backup archive for Non-Exclusive
backups (aka backups taken by pg_basebackup).

The format of tablespace_map file is:
16384 E:\WorkSpace\PostgreSQL\master\tbs
16388 E:\WorkSpace\PostgreSQL\master\tbs 2 3

The tablespace symbolic links are restored during archive recovery and the
tablespace_map file will be renamed to tablespace_map.old at the end of
recovery similar to backup_label file.

Friday, 17 April 2015

Write Scalability in PostgreSQL




I have ran some benchmark tests to see the Write performance/scalability in
PostgreSQL 9.5 and thought it would be good to share the same with others,
so writing this blog post.


I have ran a pgbench tests (TPC-B (sort of) load) to compare the performance
difference between different modes and scale factor in HEAD (e5f455f5) on
IBM POWER-8 having 24 cores, 192 hardware threads, 492GB RAM
and here are the performance results
































Some of the default settings used in all the tests are:
min_wal_size=15GB
max_wal_size=20GB
checkpoint_timeout =35min
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
autovacuum=off

I have kept auto vacuum as off to reduce the fluctuation due to same and is
dropping and re-creating the database after each run. I have kept high values
of min_wal_size and max_wal_size to reduce the effect of checkpoints, probably
somewhat lower values could have served the purpose of this workload, but I
haven't tried it.

The data is mainly taken for 2 kind of modes (synchronous_commit = on | off) and
at 2 different scale factors to cover the cases when all the data fits in shared buffers
(scale_factor = 300) and when all the data can't fit in shared buffers, but can fit in
RAM (scale_factor = 3000).

First lets talk about synchronous_commit = off case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 75 percent higher at 64 client-count as compare to 8
client count which doesn't look bad. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see scalability upto 32 client-count with
TPS being 64 percent higher than at 8 client-count and then it falls there on.

One major difference in case of Writes when data doesn't fit in shared_buffers is
that backends performing transactions needs to write the dirty buffers themselves
when they are not able to find a clean buffer to read the page, this can hamper
the TPS.

Now let's talk about synchronous_commit = on case, here when all the data fits in
shared_buffers (scale_factor = 300), we can see the scalability upto 64 client count
with TPS being approximately 189 percent higher at 64 client-count as compare to
8 client count which sounds good. When all the data doesn't fit in shared buffers,
but fit in RAM (scale_factor = 3000), we can see a pretty flat graph with some
scalability upto 16 client-count with TPS being approximately 22 percent higher than
at 8 client-count and then it stays as it is.

Here one point to note is that when the data fits in shared_buffers (scale_factor = 300),
TPS at higher client-count (64) in synchronous_commit = on mode becomes equivalent to
TPS in synchronous_commit = off which suggests that there is no major contention
due to WAL writing in such loads.

In synchronous_commit = on case, when the data doesn't fit in shared_buffers
(scale_factor = 3000), the TPS is quite low and one reason is that backends might
be performing writes themselves, but not sure if the performance is so low just
due to that reason as I have tried with different values of Bgwriter related parameters
(bgwriter_delay, bgwriter_lru_maxpages, bgwriter_lru_multiplier), but there is no
much difference.

As per my knowledge, the locks that can lead to contention for this workload
are:
a. ProcArrayLock (used for taking snapshot and at transaction commit)
b. WALWriteLock (used for performing WALWrites)
c. CLOGControlLock (used to read and write transaction status)
d. WALInsertLocks (used for writing data to WAL buffer)

I think among these ProcArrayLock and WALWriteLock are the candidates
which can be the reason for contention, but I haven't done any deep analysis
to find out the same.

Now it could be that the bottleneck is due to multiple locks as was the case
for read operations which I have explained in my previous Read Scalability
blog or it could be due to one of these locks. I think all this needs further
analysis and work. Thats all what I want to say for now.
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /