Why all Postgresql workers are waiting for a single process?

Question 1

We are using PostgreSql-13 as our core server, and encountered a performance bottleneck.

The hardware includes 2 CPUs(AMD EPYC9754 with 128 core 256 threads of each), 128GB memory, hardware RAID0 includes 2 * 4T SSD(Samsung990Pro).

I think that the PG server fails to use the full potential of the hardware.

Before I asked this question here, I have done things as following:

Check my SQL query(exactly using primary key, with no seq scanning);
Confirm my program runs correctly(I even though write the same logic in Python and C++, but both runs slowly);
Config my PG server to use HugePage(I configured 80GB HugePage memory in total, and allow PG-server use 64GB as shared_buffers);
Increase the memory limits of each PG worker(work_mem, temp_buffers,...);
Increase the number of parallel workers(max_worker_processes, max_parallel_workers_per_gather, max_parallel_workers,...);
Turn on force_parallel_mode;
Decrease the io cost and cpu cost cfg(random_page_cost=1.01, cpu_tuple_cost=0.0025, cpu_index_tuple_cost=0.001,...);
Maxiumized default_statistics_target to 1000;
Disable replications;
Startup 32 or 64 processes parallely connect to the server, and query different rows of a same table with different primary keys.
All queries are readonly, and there's NO insert/update/DDL operations during testing;

What I'm expecting:

PG server use at least 50% CPU/memory resource;
There is NO single process blocking other queries;
The total throughput basically increases linearly as the amount of the resource I allocated to PG server(maybe I'm naive?);

My testing:

The table definition:

CREATE TABLE IF NOT EXISTS ob_lots(
 t_stamp INT8 NOT NULL,
 trd_sec INT4 NOT NULL,
 o_level INT2 NOT NULL,
 i_olots INT4 NOT NULL,
 f_olots FLOAT4 NULL,
CONSTRAINT prk_ob_lots PRIMARY KEY( t_stamp, o_level ) );

The query:

SELECT f_olots, t_stamp, trd_sec FROM ob_lots
WHERE t_stamp BETWEEN 1ドル AND 2ドル
ORDER BY t_stamp DESC, o_level DESC
LIMIT 4096;

Testing program: Start up many instances of my program that randomly query rows of above table parallelly with above query SQL. When the number of the clients was increasing from 2 to 4, or from 4 to 8, or from 8 to 16, we did observed that the total throughput was nearly doubled every time. But from 16,32 or higher, the total throughput never changed.

SQL explaination:

Gather (cost=1000.28..1002.41 rows=1 width=18) (actual time=6.840..9.232 rows=0 loops=1)
Workers Planned: 1
Workers Launched: 1
Single Copy: true
Buffers: shared hit=8
-> Limit (cost=0.28..2.31 rows=1 width=18) (actual time=0.033..0.033 rows=0 loops=1)
 Buffers: shared hit=8
 -> Index Scan Backward using prk_ob_lots_sc5555 on ob_lots_sc5555 (cost=0.28..2.31 rows=1 width=18) (actual time=0.031..0.031 rows=0 loops=1)
 Index Cond: ((t_stamp >= 123) AND (t_stamp <= 456))
 Buffers: shared hit=8
Planning:
Buffers: shared hit=109
Planning Time: 0.759 ms
Execution Time: 9.274 ms

The following things I feel strange:

There is one PG process that uses almost 100% of a single CPU core, and the others use very less;
There is NO io, NO swapping during the testing, and a lot of memory/CPU/io resource are free/idle. Something looks like that all other workers are waiting for the master(I guessed it is that one uses 100% CPU) PG process, and results a bottleneck;
The test DB used about 4GB disk space, so small enough to be entirely holded in memory by PG. indeed as we observed, there's no io operation at all;

What is it doing? Why all workers are waiting for a single processes?

This is the main part of my PG conf(I only list lines that I changed):

max_connections = 2048

shared_buffers = 64GB

huge_pages = on

temp_buffers = 256MB

max_prepared_transactions = 256

work_mem = 256MB

maintenance_work_mem = 16GB

autovacuum_work_mem = -1

dynamic_shared_memory_type = posix

effective_io_concurrency = 1000

maintenance_io_concurrency = 1000

max_worker_processes = 256

max_parallel_maintenance_workers = 256

max_parallel_workers_per_gather = 256

parallel_leader_participation = on

max_parallel_workers = 256

fsync = off

synchronous_commit = off

full_page_writes = off

wal_compression = on

wal_buffers = -1

wal_writer_delay = 10000ms

wal_writer_flush_after = 1GB

commit_delay = 100000

commit_siblings = 128

checkpoint_timeout = 1d

max_wal_size = 128GB

min_wal_size = 32GB

checkpoint_completion_target = 1.0

checkpoint_flush_after = 0

checkpoint_warning = 0

max_wal_senders = 0

seq_page_cost = 1.0

random_page_cost = 1.01

cpu_tuple_cost = 0.0025

cpu_index_tuple_cost = 0.001

cpu_operator_cost = 0.00125

effective_cache_size = 64GB

default_statistics_target = 1000

force_parallel_mode = on

autovacuum = on

And the outputs of top and iotop:

top - 16:38:16 up 4:09, 2 users, load average: 14.16, 9.14, 3.97

任务: 1581 total, 2 running, 1573 sleeping, 0 stopped, 6 zombie

%Cpu(s): 3.5 us, 4.3 sy, 0.0 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st

GiB Mem : 125.6 total, 34.6 free, 82.9 used, 9.1 buff/cache

GiB Swap: 1.0 total, 1.0 free, 0.0 used. 42.6 avail Mem

进程号 USER PR NI VIRT RES SHR SWAP %CPU %MEM TIME+ COMMAND

31159 leon 20 0 4654.2m 105.1m 12.5m 0.0m S 152.7 0.1 7:09.93 load

3186 postgres 0 -20 66.7g 18.6m 16.2m 0.0m R 99.5 0.0 5:03.16 postgres #it's the master?

3192 postgres 0 -20 80.4m 6.8m 3.6m 0.0m S 8.2 0.0 0:24.97 postgres

32218 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.8 0.0 0:12.90 postgres

31217 postgres 0 -20 66.7g 13.4m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres

31234 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres

( many many postgres process... )

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s

Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE> COMMAND
 1 be/4 root 0.00 B/s 0.00 B/s init
 2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
 3 be/0 root 0.00 B/s 0.00 B/s [rcu_gp]
 4 be/0 root 0.00 B/s 0.00 B/s [rcu_par_gp]
 5 be/0 root 0.00 B/s 0.00 B/s [slub_flushwq]

Question 2

"all workers are waiting for a single processes" -- what's the evidence of that?

Question 3

Your question contains a lot of information, but lacks the crucial one: the problematic statement(s) and the EXPLAIN(ANALYZE, BUFFERS) output for them. Not all statements can use parallel query, and not all would benefit. Many of the configuration parameters you set are wrong and will result in worse performance.

Question 4

@LaurenzAlbe Thank you for your help! I have added a little info. Sorry for that I have no experience about tuning DB.

Question 5

With an index on (t_stamp, o_level), that query should take a few milliseconds. Are you sure that the other processes are waiting for that query? Perhaps they are just idle and have nothing to do. Can you describe the actual problem? Is it taking too long? Or are you only surprised that the machine is mostly idle? Also, I'll repeat: please show the output of EXPLAIN (ANALYZE, BUFFERS).

Question 6

6. Turn on force_parallel_mode: Never do that. It's deprecated in recent versions and was never meant to improve performance anyway.

Question 7

I think you are suffering from a misconception. A single query like this will never keep a strong machine busy. One CPU core is processing and the others just stand by. Distributing the workload across several cores would actually make processing slower because of the overhead of parallelism.

You can see that from your execution plan:

All the actual work is done in the backward index scan which takes 0.03 milliseconds.
The query only takes 9 milliseconds because you turned on force_parallel_mode (which you should never ever do — this parameter is only interesting for PostgreSQL development and regression tests). Starting up the additional processes, creating shared memory segments to exchange data and other overhead consumed all that time.

If you want parallel processing with short statements in PostgreSQL, you have to open several database sessions and run statements on each of them. Effectively, you have to parallelize in your application.

For bulk loading to PostgreSQL, consider using COPY rather than INSERT. If you need to use INSERT, use prepared statements.

Most of your parameter changes are pointless to detrimental:

max_connections = 2048

Those are way too many. Don't exceed your number of cores. Don't overload your machine.
shared_buffers = 64GB

That won't help with loading data.
huge_pages = on

That one is OK.
temp_buffers = 256MB

Useless unless you are using temporary tables.
max_prepared_transactions = 256

That is an evil setting and may bring your database down. Don't enable prepared transactions unless you absolutely must.
work_mem = 256MB

Okay, but pointless with small statements.
maintenance_work_mem = 16GB

Okay if you have enough RAM.
effective_io_concurrency = 1000

Probably too high.
maintenance_io_concurrency = 1000

Probably too high.
max_worker_processes = 256

Pointless, if your statements are small.
max_parallel_maintenance_workers = 256

Way too high.
max_parallel_workers_per_gather = 256

Way too high.
max_parallel_workers = 256

Probably too high.
fsync = off

This will cause data corruption. Don't.
synchronous_commit = off

This is okay, but only necessary if you have tiny transactions, which is an anti-pattern for bulk loading.
full_page_writes = off

This will cause data corruption. Don't.
wal_compression = on

This is okay, but only useful if the amount of WAL written is a bigger problem than CPU utilization.
wal_writer_delay = 10000ms

Pointless.
wal_writer_flush_after = 1GB

Won't make a difference.
commit_delay = 100000

This will make your transactions very slow. Don't.
commit_siblings = 128

This is completely ridiculous.
checkpoint_timeout = 1d

Increasing this parameter may actually be a good idea, but one day is excessive.
max_wal_size = 128GB

That one is actually a useful setting.
min_wal_size = 32GB

Pointless.
checkpoint_completion_target = 1.0

Probably pointless, might harm if it delays the next checkpoint.
checkpoint_flush_after = 0

That will harm performance. Don't.
checkpoint_warning = 0

Pointless.
max_wal_senders = 0

Pointless.
random_page_cost = 1.01

Good if you have SSDs or similar.
cpu_tuple_cost = 0.0025

Don't mess with that.
cpu_index_tuple_cost = 0.001

Don't mess with that.
cpu_operator_cost = 0.00125

Don't mess with that.
effective_cache_size = 64GB

Good!
default_statistics_target = 1000

That will harm performance.
force_parallel_mode = on

That will harm performance.

Question 8

I need to squeeze every drop of the hardware potential capability. In fact I'm trainning a AI model, and the PG server is being used to store all input features(It's NOT an OLTP). The actual amount of the features data is about tens TBs. It is impossible to preload all the features into memory, and then start the trainning. I can only load a few of batches and feed them to the model, and so on. The more throughput, the faster trainning. Now the biggest trouble I'm facing is that the processing of loading inputs is too slow, so that the GPU almost is idle.

Question 9

I known in term of the latency of one single query, it's not slow. But what I need is the overall throughput, not a single query. Furthermore, I don't think it is normal, that only one worker has exhausted while the others look like falling asleep, since I have started 64 or more clients concurrently to load data. What I'm expecting ideally is that 256 workers serving 256 clients, because I have 512 virtual CPUs. Anyway, the phenomenon one worker busy and the rest sleeping is abnormal, isn't it?

Question 10

It is normal if you are running your workload on a single database session.

Question 11

@Leon The database configuration is pretty bad; you should change that first. Such a small database with so much RAM will live in memory, just like a memory database. A Redis database with the wrong configuration will also be slow, changing the technology will not change that.

Question 12

I merely used a small part for testing Hmm, this could lead to (very!) different results in production when you're working with large amounts of data. Advice: Test with real data and not a small subset. Also improve your configuration before any testing.

Laurenz Albe Laurenz Albe 62.1k4 gold badges57 silver badges93 bronze badges · Accepted Answer · 2025-03-24 12:27:26Z

I think you are suffering from a misconception. A single query like this will never keep a strong machine busy. One CPU core is processing and the others just stand by. Distributing the workload across several cores would actually make processing slower because of the overhead of parallelism.

You can see that from your execution plan:

All the actual work is done in the backward index scan which takes 0.03 milliseconds.
The query only takes 9 milliseconds because you turned on force_parallel_mode (which you should never ever do — this parameter is only interesting for PostgreSQL development and regression tests). Starting up the additional processes, creating shared memory segments to exchange data and other overhead consumed all that time.

If you want parallel processing with short statements in PostgreSQL, you have to open several database sessions and run statements on each of them. Effectively, you have to parallelize in your application.

For bulk loading to PostgreSQL, consider using COPY rather than INSERT. If you need to use INSERT, use prepared statements.

Most of your parameter changes are pointless to detrimental:

max_connections = 2048

Those are way too many. Don't exceed your number of cores. Don't overload your machine.
shared_buffers = 64GB

That won't help with loading data.
huge_pages = on

That one is OK.
temp_buffers = 256MB

Useless unless you are using temporary tables.
max_prepared_transactions = 256

That is an evil setting and may bring your database down. Don't enable prepared transactions unless you absolutely must.
work_mem = 256MB

Okay, but pointless with small statements.
maintenance_work_mem = 16GB

Okay if you have enough RAM.
effective_io_concurrency = 1000

Probably too high.
maintenance_io_concurrency = 1000

Probably too high.
max_worker_processes = 256

Pointless, if your statements are small.
max_parallel_maintenance_workers = 256

Way too high.
max_parallel_workers_per_gather = 256

Way too high.
max_parallel_workers = 256

Probably too high.
fsync = off

This will cause data corruption. Don't.
synchronous_commit = off

This is okay, but only necessary if you have tiny transactions, which is an anti-pattern for bulk loading.
full_page_writes = off

This will cause data corruption. Don't.
wal_compression = on

This is okay, but only useful if the amount of WAL written is a bigger problem than CPU utilization.
wal_writer_delay = 10000ms

Pointless.
wal_writer_flush_after = 1GB

Won't make a difference.
commit_delay = 100000

This will make your transactions very slow. Don't.
commit_siblings = 128

This is completely ridiculous.
checkpoint_timeout = 1d

Increasing this parameter may actually be a good idea, but one day is excessive.
max_wal_size = 128GB

That one is actually a useful setting.
min_wal_size = 32GB

Pointless.
checkpoint_completion_target = 1.0

Probably pointless, might harm if it delays the next checkpoint.
checkpoint_flush_after = 0

That will harm performance. Don't.
checkpoint_warning = 0

Pointless.
max_wal_senders = 0

Pointless.
random_page_cost = 1.01

Good if you have SSDs or similar.
cpu_tuple_cost = 0.0025

Don't mess with that.
cpu_index_tuple_cost = 0.001

Don't mess with that.
cpu_operator_cost = 0.00125

Don't mess with that.
effective_cache_size = 64GB

Good!
default_statistics_target = 1000

That will harm performance.
force_parallel_mode = on

That will harm performance.

I need to squeeze every drop of the hardware potential capability. In fact I'm trainning a AI model, and the PG server is being used to store all input features(It's NOT an OLTP). The actual amount of the features data is about tens TBs. It is impossible to preload all the features into memory, and then start the trainning. I can only load a few of batches and feed them to the model, and so on. The more throughput, the faster trainning. Now the biggest trouble I'm facing is that the processing of loading inputs is too slow, so that the GPU almost is idle.
I known in term of the latency of one single query, it's not slow. But what I need is the overall throughput, not a single query. Furthermore, I don't think it is normal, that only one worker has exhausted while the others look like falling asleep, since I have started 64 or more clients concurrently to load data. What I'm expecting ideally is that 256 workers serving 256 clients, because I have 512 virtual CPUs. Anyway, the phenomenon one worker busy and the rest sleeping is abnormal, isn't it?
It is normal if you are running your workload on a single database session.
@Leon The database configuration is pretty bad; you should change that first. Such a small database with so much RAM will live in memory, just like a memory database. A Redis database with the wrong configuration will also be slow, changing the technology will not change that.
I merely used a small part for testing Hmm, this could lead to (very!) different results in production when you're working with large amounts of data. Advice: Test with real data and not a small subset. Also improve your configuration before any testing.

Stack Exchange Network

Why all Postgresql workers are waiting for a single process?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Why all Postgresql workers are waiting for a single process?

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions