1

We are using PostgreSql-13 as our core server, and encountered a performance bottleneck.

The hardware includes 2 CPUs(AMD EPYC9754 with 128 core 256 threads of each), 128GB memory, hardware RAID0 includes 2 * 4T SSD(Samsung990Pro).

I think that the PG server fails to use the full potential of the hardware.

Before I asked this question here, I have done things as following:

  1. Check my SQL query(exactly using primary key, with no seq scanning);
  2. Confirm my program runs correctly(I even though write the same logic in Python and C++, but both runs slowly);
  3. Config my PG server to use HugePage(I configured 80GB HugePage memory in total, and allow PG-server use 64GB as shared_buffers);
  4. Increase the memory limits of each PG worker(work_mem, temp_buffers,...);
  5. Increase the number of parallel workers(max_worker_processes, max_parallel_workers_per_gather, max_parallel_workers,...);
  6. Turn on force_parallel_mode;
  7. Decrease the io cost and cpu cost cfg(random_page_cost=1.01, cpu_tuple_cost=0.0025, cpu_index_tuple_cost=0.001,...);
  8. Maxiumized default_statistics_target to 1000;
  9. Disable replications;
  10. Startup 32 or 64 processes parallely connect to the server, and query different rows of a same table with different primary keys.
  11. All queries are readonly, and there's NO insert/update/DDL operations during testing;

What I'm expecting:

  1. PG server use at least 50% CPU/memory resource;
  2. There is NO single process blocking other queries;
  3. The total throughput basically increases linearly as the amount of the resource I allocated to PG server(maybe I'm naive?);

My testing:

  1. The table definition:

    CREATE TABLE IF NOT EXISTS ob_lots(
     t_stamp INT8 NOT NULL,
     trd_sec INT4 NOT NULL,
     o_level INT2 NOT NULL,
     i_olots INT4 NOT NULL,
     f_olots FLOAT4 NULL,
    CONSTRAINT prk_ob_lots PRIMARY KEY( t_stamp, o_level ) );
    
  2. The query:

    SELECT f_olots, t_stamp, trd_sec FROM ob_lots
    WHERE t_stamp BETWEEN 1ドル AND 2ドル
    ORDER BY t_stamp DESC, o_level DESC
    LIMIT 4096;
    
  3. Testing program: Start up many instances of my program that randomly query rows of above table parallelly with above query SQL. When the number of the clients was increasing from 2 to 4, or from 4 to 8, or from 8 to 16, we did observed that the total throughput was nearly doubled every time. But from 16,32 or higher, the total throughput never changed.

  4. SQL explaination:

    Gather (cost=1000.28..1002.41 rows=1 width=18) (actual time=6.840..9.232 rows=0 loops=1)
    Workers Planned: 1
    Workers Launched: 1
    Single Copy: true
    Buffers: shared hit=8
    -> Limit (cost=0.28..2.31 rows=1 width=18) (actual time=0.033..0.033 rows=0 loops=1)
     Buffers: shared hit=8
     -> Index Scan Backward using prk_ob_lots_sc5555 on ob_lots_sc5555 (cost=0.28..2.31 rows=1 width=18) (actual time=0.031..0.031 rows=0 loops=1)
     Index Cond: ((t_stamp >= 123) AND (t_stamp <= 456))
     Buffers: shared hit=8
    Planning:
    Buffers: shared hit=109
    Planning Time: 0.759 ms
    Execution Time: 9.274 ms
    

The following things I feel strange:

  1. There is one PG process that uses almost 100% of a single CPU core, and the others use very less;
  2. There is NO io, NO swapping during the testing, and a lot of memory/CPU/io resource are free/idle. Something looks like that all other workers are waiting for the master(I guessed it is that one uses 100% CPU) PG process, and results a bottleneck;
  3. The test DB used about 4GB disk space, so small enough to be entirely holded in memory by PG. indeed as we observed, there's no io operation at all;

What is it doing? Why all workers are waiting for a single processes?

This is the main part of my PG conf(I only list lines that I changed):

max_connections = 2048

shared_buffers = 64GB

huge_pages = on

temp_buffers = 256MB

max_prepared_transactions = 256

work_mem = 256MB

maintenance_work_mem = 16GB

autovacuum_work_mem = -1

dynamic_shared_memory_type = posix

effective_io_concurrency = 1000

maintenance_io_concurrency = 1000

max_worker_processes = 256

max_parallel_maintenance_workers = 256

max_parallel_workers_per_gather = 256

parallel_leader_participation = on

max_parallel_workers = 256

fsync = off

synchronous_commit = off

full_page_writes = off

wal_compression = on

wal_buffers = -1

wal_writer_delay = 10000ms

wal_writer_flush_after = 1GB

commit_delay = 100000

commit_siblings = 128

checkpoint_timeout = 1d

max_wal_size = 128GB

min_wal_size = 32GB

checkpoint_completion_target = 1.0

checkpoint_flush_after = 0

checkpoint_warning = 0

max_wal_senders = 0

seq_page_cost = 1.0

random_page_cost = 1.01

cpu_tuple_cost = 0.0025

cpu_index_tuple_cost = 0.001

cpu_operator_cost = 0.00125

effective_cache_size = 64GB

default_statistics_target = 1000

force_parallel_mode = on

autovacuum = on

And the outputs of top and iotop:

top - 16:38:16 up 4:09, 2 users, load average: 14.16, 9.14, 3.97

任务: 1581 total, 2 running, 1573 sleeping, 0 stopped, 6 zombie

%Cpu(s): 3.5 us, 4.3 sy, 0.0 ni, 92.1 id, 0.0 wa, 0.0 hi, 0.1 si, 0.0 st

GiB Mem : 125.6 total, 34.6 free, 82.9 used, 9.1 buff/cache

GiB Swap: 1.0 total, 1.0 free, 0.0 used. 42.6 avail Mem

进程号 USER PR NI VIRT RES SHR SWAP %CPU %MEM TIME+ COMMAND

31159 leon 20 0 4654.2m 105.1m 12.5m 0.0m S 152.7 0.1 7:09.93 load

3186 postgres 0 -20 66.7g 18.6m 16.2m 0.0m R 99.5 0.0 5:03.16 postgres #it's the master?

3192 postgres 0 -20 80.4m 6.8m 3.6m 0.0m S 8.2 0.0 0:24.97 postgres

32218 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.8 0.0 0:12.90 postgres

31217 postgres 0 -20 66.7g 13.4m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres

31234 postgres 0 -20 66.7g 13.5m 9.9m 0.0m S 5.3 0.0 0:12.74 postgres

( many many postgres process... )

Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s

Current DISK READ: 0.00 B/s | Current DISK WRITE: 0.00 B/s

TID PRIO USER DISK READ DISK WRITE> COMMAND
 1 be/4 root 0.00 B/s 0.00 B/s init
 2 be/4 root 0.00 B/s 0.00 B/s [kthreadd]
 3 be/0 root 0.00 B/s 0.00 B/s [rcu_gp]
 4 be/0 root 0.00 B/s 0.00 B/s [rcu_par_gp]
 5 be/0 root 0.00 B/s 0.00 B/s [slub_flushwq]
asked Mar 23 at 10:27
12
  • "all workers are waiting for a single processes" -- what's the evidence of that? Commented Mar 23 at 12:06
  • Your question contains a lot of information, but lacks the crucial one: the problematic statement(s) and the EXPLAIN(ANALYZE, BUFFERS) output for them. Not all statements can use parallel query, and not all would benefit. Many of the configuration parameters you set are wrong and will result in worse performance. Commented Mar 24 at 7:11
  • @LaurenzAlbe Thank you for your help! I have added a little info. Sorry for that I have no experience about tuning DB. Commented Mar 24 at 9:53
  • 1
    With an index on (t_stamp, o_level), that query should take a few milliseconds. Are you sure that the other processes are waiting for that query? Perhaps they are just idle and have nothing to do. Can you describe the actual problem? Is it taking too long? Or are you only surprised that the machine is mostly idle? Also, I'll repeat: please show the output of EXPLAIN (ANALYZE, BUFFERS). Commented Mar 24 at 10:30
  • 3
    6. Turn on force_parallel_mode: Never do that. It's deprecated in recent versions and was never meant to improve performance anyway. Commented Mar 24 at 12:15

1 Answer 1

2

I think you are suffering from a misconception. A single query like this will never keep a strong machine busy. One CPU core is processing and the others just stand by. Distributing the workload across several cores would actually make processing slower because of the overhead of parallelism.

You can see that from your execution plan:

  • All the actual work is done in the backward index scan which takes 0.03 milliseconds.

  • The query only takes 9 milliseconds because you turned on force_parallel_mode (which you should never ever do — this parameter is only interesting for PostgreSQL development and regression tests). Starting up the additional processes, creating shared memory segments to exchange data and other overhead consumed all that time.

If you want parallel processing with short statements in PostgreSQL, you have to open several database sessions and run statements on each of them. Effectively, you have to parallelize in your application.

For bulk loading to PostgreSQL, consider using COPY rather than INSERT. If you need to use INSERT, use prepared statements.

Most of your parameter changes are pointless to detrimental:

  • max_connections = 2048

    Those are way too many. Don't exceed your number of cores. Don't overload your machine.

  • shared_buffers = 64GB

    That won't help with loading data.

  • huge_pages = on

    That one is OK.

  • temp_buffers = 256MB

    Useless unless you are using temporary tables.

  • max_prepared_transactions = 256

    That is an evil setting and may bring your database down. Don't enable prepared transactions unless you absolutely must.

  • work_mem = 256MB

    Okay, but pointless with small statements.

  • maintenance_work_mem = 16GB

    Okay if you have enough RAM.

  • effective_io_concurrency = 1000

    Probably too high.

  • maintenance_io_concurrency = 1000

    Probably too high.

  • max_worker_processes = 256

    Pointless, if your statements are small.

  • max_parallel_maintenance_workers = 256

    Way too high.

  • max_parallel_workers_per_gather = 256

    Way too high.

  • max_parallel_workers = 256

    Probably too high.

  • fsync = off

    This will cause data corruption. Don't.

  • synchronous_commit = off

    This is okay, but only necessary if you have tiny transactions, which is an anti-pattern for bulk loading.

  • full_page_writes = off

    This will cause data corruption. Don't.

  • wal_compression = on

    This is okay, but only useful if the amount of WAL written is a bigger problem than CPU utilization.

  • wal_writer_delay = 10000ms

    Pointless.

  • wal_writer_flush_after = 1GB

    Won't make a difference.

  • commit_delay = 100000

    This will make your transactions very slow. Don't.

  • commit_siblings = 128

    This is completely ridiculous.

  • checkpoint_timeout = 1d

    Increasing this parameter may actually be a good idea, but one day is excessive.

  • max_wal_size = 128GB

    That one is actually a useful setting.

  • min_wal_size = 32GB

    Pointless.

  • checkpoint_completion_target = 1.0

    Probably pointless, might harm if it delays the next checkpoint.

  • checkpoint_flush_after = 0

    That will harm performance. Don't.

  • checkpoint_warning = 0

    Pointless.

  • max_wal_senders = 0

    Pointless.

  • random_page_cost = 1.01

    Good if you have SSDs or similar.

  • cpu_tuple_cost = 0.0025

    Don't mess with that.

  • cpu_index_tuple_cost = 0.001

    Don't mess with that.

  • cpu_operator_cost = 0.00125

    Don't mess with that.

  • effective_cache_size = 64GB

    Good!

  • default_statistics_target = 1000

    That will harm performance.

  • force_parallel_mode = on

    That will harm performance.

answered Mar 24 at 12:27
11
  • I need to squeeze every drop of the hardware potential capability. In fact I'm trainning a AI model, and the PG server is being used to store all input features(It's NOT an OLTP). The actual amount of the features data is about tens TBs. It is impossible to preload all the features into memory, and then start the trainning. I can only load a few of batches and feed them to the model, and so on. The more throughput, the faster trainning. Now the biggest trouble I'm facing is that the processing of loading inputs is too slow, so that the GPU almost is idle. Commented Mar 24 at 13:18
  • I known in term of the latency of one single query, it's not slow. But what I need is the overall throughput, not a single query. Furthermore, I don't think it is normal, that only one worker has exhausted while the others look like falling asleep, since I have started 64 or more clients concurrently to load data. What I'm expecting ideally is that 256 workers serving 256 clients, because I have 512 virtual CPUs. Anyway, the phenomenon one worker busy and the rest sleeping is abnormal, isn't it? Commented Mar 24 at 13:18
  • It is normal if you are running your workload on a single database session. Commented Mar 24 at 13:38
  • 1
    @Leon The database configuration is pretty bad; you should change that first. Such a small database with so much RAM will live in memory, just like a memory database. A Redis database with the wrong configuration will also be slow, changing the technology will not change that. Commented Mar 24 at 21:07
  • 1
    I merely used a small part for testing Hmm, this could lead to (very!) different results in production when you're working with large amounts of data. Advice: Test with real data and not a small subset. Also improve your configuration before any testing. Commented Mar 24 at 23:45

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.