We are using PostgreSql-13 as our core server, and encountered a performance bottleneck.
The hardware includes 2 CPUs(AMD EPYC9754 with 128 core 256 threads of each), 128GB memory, hardware RAID0 includes 2 * 4T SSD(Samsung990Pro).
I think that the PG server fails to use the full potential of the hardware.
Before I asked this question here, I have done things as following:
- Check my SQL query(exactly using primary key, with no seq scanning);
- Confirm my program runs correctly(I even though write the same logic in Python and C++, but both runs slowly);
- Config my PG server to use HugePage(I configured 80GB HugePage memory in total, and allow PG-server use 64GB as
shared_buffers
); - Increase the memory limits of each PG worker(
work_mem
,temp_buffers
,...); - Increase the number of parallel workers(
max_worker_processes
,max_parallel_workers_per_gather
,max_parallel_workers
,...); - Turn on
force_parallel_mode
; - Decrease the io cost and cpu cost cfg(
random_page_cost=1.01
,cpu_tuple_cost=0.0025
,cpu_index_tuple_cost=0.001
,...); - Maxiumized
default_statistics_target
to 1000; - Disable replications;
- Startup 32 or 64 processes parallely connect to the server, and query different rows of a same table with different primary keys.
- All queries are readonly, and there's NO insert/update/DDL operations during testing;
What I'm expecting:
- PG server use at least 50% CPU/memory resource;
- There is NO single process blocking other queries;
- The total throughput basically increases linearly as the amount of the resource I allocated to PG server(maybe I'm naive?);
My testing:
The table definition:
CREATE TABLE IF NOT EXISTS ob_lots( t_stamp INT8 NOT NULL, trd_sec INT4 NOT NULL, o_level INT2 NOT NULL, i_olots INT4 NOT NULL, f_olots FLOAT4 NULL, CONSTRAINT prk_ob_lots PRIMARY KEY( t_stamp, o_level ) );
The query:
SELECT f_olots, t_stamp, trd_sec FROM ob_lots WHERE t_stamp BETWEEN 1ドル AND 2ドル ORDER BY t_stamp DESC, o_level DESC LIMIT 4096;
Testing program: Start up many instances of my program that randomly query rows of above table parallelly with above query SQL. When the number of the clients was increasing from 2 to 4, or from 4 to 8, or from 8 to 16, we did observed that the total throughput was nearly doubled every time. But from 16,32 or higher, the total throughput never changed.
SQL explaination:
Gather (cost=1000.28..1002.41 rows=1 width=18) (actual time=6.840..9.232 rows=0 loops=1) Workers Planned: 1 Workers Launched: 1 Single Copy: true Buffers: shared hit=8 -> Limit (cost=0.28..2.31 rows=1 width=18) (actual time=0.033..0.033 rows=0 loops=1) Buffers: shared hit=8 -> Index Scan Backward using prk_ob_lots_sc5555 on ob_lots_sc5555 (cost=0.28..2.31 rows=1 width=18) (actual time=0.031..0.031 rows=0 loops=1) Index Cond: ((t_stamp >= 123) AND (t_stamp <= 456)) Buffers: shared hit=8 Planning: Buffers: shared hit=109 Planning Time: 0.759 ms Execution Time: 9.274 ms
The following things I feel strange:
- There is one PG process that uses almost 100% of a single CPU core, and the others use very less;
- There is NO io, NO swapping during the testing, and a lot of memory/CPU/io resource are free/idle. Something looks like that all other workers are waiting for the master(I guessed it is that one uses 100% CPU) PG process, and results a bottleneck;
- The test DB used about 4GB disk space, so small enough to be entirely holded in memory by PG. indeed as we observed, there's no io operation at all;
What is it doing? Why all workers are waiting for a single processes?
This is the main part of my PG conf(I only list lines that I changed):
max_connections = 2048
shared_buffers = 64GB
huge_pages = on
temp_buffers = 256MB
max_prepared_transactions = 256
work_mem = 256MB
maintenance_work_mem = 16GB
autovacuum_work_mem = -1
dynamic_shared_memory_type = posix
effective_io_concurrency = 1000
maintenance_io_concurrency = 1000
max_worker_processes = 256
max_parallel_maintenance_workers = 256
max_parallel_workers_per_gather = 256
parallel_leader_participation = on
max_parallel_workers = 256
fsync = off
synchronous_commit = off
full_page_writes = off
wal_compression = on
wal_buffers = -1
wal_writer_delay = 10000ms
wal_writer_flush_after = 1GB
commit_delay = 100000
commit_siblings = 128
checkpoint_timeout = 1d
max_wal_size = 128GB
min_wal_size = 32GB
checkpoint_completion_target = 1.0
checkpoint_flush_after = 0
checkpoint_warning = 0
max_wal_senders = 0
seq_page_cost = 1.0
random_page_cost = 1.01
cpu_tuple_cost = 0.0025
cpu_index_tuple_cost = 0.001
cpu_operator_cost = 0.00125
effective_cache_size = 64GB
default_statistics_target = 1000
force_parallel_mode = on
autovacuum = on
And the outputs of top
and iotop
:
top - 16:38:16 up 4:09, 2 users, load average: 14.16, 9.14, 3.97
任务: 1581 total, 2 running, 1573 sleeping, 0 stopped, 6 zombie
%Cpu(s): 3.5 us, 4.3 sy, 0.0 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st
GiB Mem : 125.6 total, 34.6 free, 82.9 used, 9.1 buff/cache
GiB Swap: 1.0 total, 1.0 free, 0.0 used. 42.6 avail Mem
进程号 USER PR NI VIRT RES SHR SWAP %CPU %MEM TIME+ COMMAND
31159 leon 20 0 4654.2m 105.1m 12.5m 0.0m S 152.7 0.1 7:09.93 load
3186 postgres 0 -20 66.7g 18.6m 16.2m 0.0m R 99.5 0.0 5:03.16 postgres #it's the master?
3192 postgres 0 -20 80.4m 6.8m 3.6m 0.0m S 8.2 0.0 0:24.97 postgres
32218 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.8 0.0 0:12.90 postgres
31217 postgres 0 -20 66.7g 13.4m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres
31234 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres
( many many postgres process... )
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE> COMMAND 1 be/4 root 0.00 B/s 0.00 B/s init 2 be/4 root 0.00 B/s 0.00 B/s [kthreadd] 3 be/0 root 0.00 B/s 0.00 B/s [rcu_gp] 4 be/0 root 0.00 B/s 0.00 B/s [rcu_par_gp] 5 be/0 root 0.00 B/s 0.00 B/s [slub_flushwq]
1 Answer 1
I think you are suffering from a misconception. A single query like this will never keep a strong machine busy. One CPU core is processing and the others just stand by. Distributing the workload across several cores would actually make processing slower because of the overhead of parallelism.
You can see that from your execution plan:
All the actual work is done in the backward index scan which takes 0.03 milliseconds.
The query only takes 9 milliseconds because you turned on
force_parallel_mode
(which you should never ever do — this parameter is only interesting for PostgreSQL development and regression tests). Starting up the additional processes, creating shared memory segments to exchange data and other overhead consumed all that time.
If you want parallel processing with short statements in PostgreSQL, you have to open several database sessions and run statements on each of them. Effectively, you have to parallelize in your application.
For bulk loading to PostgreSQL, consider using COPY
rather than INSERT
. If you need to use INSERT
, use prepared statements.
Most of your parameter changes are pointless to detrimental:
max_connections = 2048
Those are way too many. Don't exceed your number of cores. Don't overload your machine.
shared_buffers = 64GB
That won't help with loading data.
huge_pages = on
That one is OK.
temp_buffers = 256MB
Useless unless you are using temporary tables.
max_prepared_transactions = 256
That is an evil setting and may bring your database down. Don't enable prepared transactions unless you absolutely must.
work_mem = 256MB
Okay, but pointless with small statements.
maintenance_work_mem = 16GB
Okay if you have enough RAM.
effective_io_concurrency = 1000
Probably too high.
maintenance_io_concurrency = 1000
Probably too high.
max_worker_processes = 256
Pointless, if your statements are small.
max_parallel_maintenance_workers = 256
Way too high.
max_parallel_workers_per_gather = 256
Way too high.
max_parallel_workers = 256
Probably too high.
fsync = off
This will cause data corruption. Don't.
synchronous_commit = off
This is okay, but only necessary if you have tiny transactions, which is an anti-pattern for bulk loading.
full_page_writes = off
This will cause data corruption. Don't.
wal_compression = on
This is okay, but only useful if the amount of WAL written is a bigger problem than CPU utilization.
wal_writer_delay = 10000ms
Pointless.
wal_writer_flush_after = 1GB
Won't make a difference.
commit_delay = 100000
This will make your transactions very slow. Don't.
commit_siblings = 128
This is completely ridiculous.
checkpoint_timeout = 1d
Increasing this parameter may actually be a good idea, but one day is excessive.
max_wal_size = 128GB
That one is actually a useful setting.
min_wal_size = 32GB
Pointless.
checkpoint_completion_target = 1.0
Probably pointless, might harm if it delays the next checkpoint.
checkpoint_flush_after = 0
That will harm performance. Don't.
checkpoint_warning = 0
Pointless.
max_wal_senders = 0
Pointless.
random_page_cost = 1.01
Good if you have SSDs or similar.
cpu_tuple_cost = 0.0025
Don't mess with that.
cpu_index_tuple_cost = 0.001
Don't mess with that.
cpu_operator_cost = 0.00125
Don't mess with that.
effective_cache_size = 64GB
Good!
default_statistics_target = 1000
That will harm performance.
force_parallel_mode = on
That will harm performance.
-
I need to squeeze every drop of the hardware potential capability. In fact I'm trainning a AI model, and the PG server is being used to store all input features(It's NOT an OLTP). The actual amount of the features data is about tens TBs. It is impossible to preload all the features into memory, and then start the trainning. I can only load a few of batches and feed them to the model, and so on. The more throughput, the faster trainning. Now the biggest trouble I'm facing is that the processing of loading inputs is too slow, so that the GPU almost is idle.Leon– Leon2025年03月24日 13:18:23 +00:00Commented Mar 24 at 13:18
-
I known in term of the latency of one single query, it's not slow. But what I need is the overall throughput, not a single query. Furthermore, I don't think it is normal, that only one worker has exhausted while the others look like falling asleep, since I have started 64 or more clients concurrently to load data. What I'm expecting ideally is that 256 workers serving 256 clients, because I have 512 virtual CPUs. Anyway, the phenomenon one worker busy and the rest sleeping is abnormal, isn't it?Leon– Leon2025年03月24日 13:18:32 +00:00Commented Mar 24 at 13:18
-
It is normal if you are running your workload on a single database session.Laurenz Albe– Laurenz Albe2025年03月24日 13:38:50 +00:00Commented Mar 24 at 13:38
-
1@Leon The database configuration is pretty bad; you should change that first. Such a small database with so much RAM will live in memory, just like a memory database. A Redis database with the wrong configuration will also be slow, changing the technology will not change that.Frank Heikens– Frank Heikens2025年03月24日 21:07:49 +00:00Commented Mar 24 at 21:07
-
1
I merely used a small part for testing
Hmm, this could lead to (very!) different results in production when you're working with large amounts of data. Advice: Test with real data and not a small subset. Also improve your configuration before any testing.Frank Heikens– Frank Heikens2025年03月24日 23:45:38 +00:00Commented Mar 24 at 23:45
Explore related questions
See similar questions with these tags.
EXPLAIN(ANALYZE, BUFFERS)
output for them. Not all statements can use parallel query, and not all would benefit. Many of the configuration parameters you set are wrong and will result in worse performance.(t_stamp, o_level)
, that query should take a few milliseconds. Are you sure that the other processes are waiting for that query? Perhaps they are just idle and have nothing to do. Can you describe the actual problem? Is it taking too long? Or are you only surprised that the machine is mostly idle? Also, I'll repeat: please show the output ofEXPLAIN (ANALYZE, BUFFERS)
.