5

Problem Symptom

High CPU usage by postmaster child process associated with client connections attempting to INSERT low-volume rows (results in rows INSERTed 25x slower than using COPY ... FROM STDIN for identical rows).

Background

Trying to identify system/ database configuration to alleviate aforementioned poor INSERT performance. I'm using a multi-threaded R script to process data and INSERT results back into a PostgreSQL database. I've profiled the R script to isolate performance bottleneck to the DBI::dbBind() calls, while using top to monitor the postmaster child process associated to the connections opened by the child R threads (see code below). During the INSERTs, R child processes run mostly idle (presumably waiting for return of DBI::dbBind() call) whereas the postmaster child processes consume 95-100% CPU on the cores on which they run for approximately 2-3 minutes.

System / environment:

  • postgresql version 10.3 (Fedora package 10.3-5.fc27)
  • uname -a : Linux localhost 4.16.6-202.fc27.x86-64 #1 SMP Wed May 2 00:09:32 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
  • /proc/cpuinfo : 16 processors (Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz)
  • ulimit -a :

    core file size (blocks, -c) 0
    data seg size (kbytes, -d) unlimited
    scheduling priority (-e) 0
    file size (blocks, -f) unlimited
    pending signals (-i) 515220
    max locked memory (kbytes, -l) 64
    max memory size (kbytes, -m) unlimited
    open files (-n) 1024
    pipe size (512 bytes, -p) 8
    POSIX message queues (bytes, -q) 819200
    real-time priority (-r) 0
    stack size (kbytes, -s) 8192
    cpu time (seconds, -t) unlimited
    max user processes (-u) 515220
    virtual memory (kbytes, -v) unlimited
    file locks (-x) unlimited
    
  • /proc/meminfo (at arbitrary time when postmaster processing INSERT):

    MemTotal: 131923484 kB
    MemFree: 112894260 kB
    MemAvailable: 123181440 kB
    Buffers: 201220 kB
    Cached: 14932288 kB
    SwapCached: 0 kB
    ...
    Mlocked: 0 kB
    SwapTotal: 201326588 kB
    SwapFree: 201326588 kB
    Dirty: 3260 kB
    ...
    Shmem: 4251184 kB
    Slab: 1024344 kB
    SReclaimable: 658476 kB
    SUnreclaim: 365868 kB
    ...
    PageTables: 38436 kB
    ...
    CommitLimit: 267288328 kB
    Committed_AS: 35678744 kB
    VmallocTotal: 34359738367 kB
    VmallocUsed: 0 kB
    VmallocChunk: 0 kB
    HardwareCorrupted: 0 kB
    ...
    ShmemHugePages: 0 kB
    ShmemPmdMapped: 0 kB
    ...
    HugePages_Total: 0
    HugePages_Free: 0
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 0 kB
    ...
    
  • Database tables in testing reside on pair of SSD drives in RAID 10 configuration, with logging and WAL on separate drives (since baseline COPY ... FROM STDIN has no performance issue writing --see below-- (timing also includes bash sync return time) and CPUs do not generally appear to wait on IO, do not believe disk performance presents the bottlneck)

Postgres configuration:

  • Following is those settings not commented-out (eg. generally non-default settings)
  • Some of these may appear a bit high in attempt to identify configuration change (please feel free to note any "crazy" settings given my system -- assuming the server primarily runs as database server)

    max_connections = 20 # (change requires restart)
    shared_buffers = 32GB # min 128kB
    temp_buffers = 128MB # min 800kB
    max_prepared_transactions = 20 # Allow 'max_connections'
    work_mem = 16MB # min 64kB
    max_stack_depth = 6MB # min 100kB
    dynamic_shared_memory_type = posix # the default is the first option
    max_wal_size = 5GB
    checkpoint_flush_after = 1MB # measured in pages, 0 disables
    deadlock_timeout = 15s
    max_locks_per_transaction = 2096 # min 10
    
  • Also (temporarily) tried the following in /lib/systemd/system/postgresql.service.d/10-postgresql-unlimited.conf to no effect (actually a small degradation of performance): [WARNING For casual readers, the following settings are NOT recommended for use]

    # DO NOT USE THIS IN PRODUCTION (or elsewise)
    [Service]
    LimitDATA=infinity
    LimitFSIZE=infinity
    LimitLOCKS=infinity
    LimitMEMLOCK=infinity
    LimitMSGQUEUE=infinity
    LimitNPROC=infinity
    LimitNOFILE=infinity
    LimitSIGPENDING=infinity
    

Database profile:

  • Database cluster only has one user, and accessed generally by multi-threaded processes (between 1 and 3 threads used for testing, but up to 16 expected for automated processes and an additional 3-4 connections for administrators).
  • Testing database initialized with table definitions, but no rows. Table definitions include 28 partitioned tables (using Declarative Partitioning) that lack trigger functions and only have limited foreign keys or constraints/ indices.
  • Each partitioned table has 180 partitions and between 42 and 81 columns (of double precision type).

    NB: While these are wider tables than ideal and have more partitions than recommended, I have attempted to reduce the number of columns to 12 and the number of partitions to 30 and see ratable poor INSERT performance.

Code (R):

  • Currently based on PostgreSQL ODBC driver (only one I could get working with multi-line prepared statements)

    library(odbc); 
    cdb <- DBI::dbConnect(drv=odbc::odbc(),driver="PostgreSQL",...);
    on.exit({DBI::dbDisconnect(cdb);}, add=TRUE);
    DBI::dbBegin(cdb);
    sth <- DBI::dbSendStatement(cdb,paste(rep("INSERT INTO <table> (<col_1>,...,<col_n>) VALUES (?,...,?);",100),collapse="")); # Bind up to 8,000 placeholders at a time (ref in source?) -- use 100 multi-line statements in this example
    DBI::dbBind(sth,bvallist); # Here, 'bvallist' is a list of values to bind in the multi-line prepared statement -- have tested and checked values in database are correct after INSERT
    num_recs <- DBI::dbGetRowsAffected(sth);
    DBI::dbClearResult(sth);
    DBI::dbCommit(cdb);
    
    • NB Have also used the following with similar performance: (A) prepared statements with only one INSERT (both in odbc and RPostgres drivers) as well as (B) assemble multi-line INSERT SQL string without use of prepared statements.

Performance reports:

Baseline

  • Based on COPY ... FROM STDIN -- a simple bash script containing the following command (one for each of the 28 tables): cat <tablename>.out | psql -c 'COPY <tablename> FROM STDIN;'
  • INSERTed 11,117 rows into each of the 28 partitioned tables in 14sec => 22,234 rec/sec (or 4.5e-5 sec/rows)
  • vmstat -wt 1 information for COPY ... FROM STDIN (CPU info based on 16 processors):

    procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- -----timestamp-----
    r b swpd free buff cache si so bi bo in cs us sy id wa st EST
    2 0 0 113563384 200600 15965816 0 0 0 696 4178 1518 14 1 85 0 0 2018年11月27日 14:29:19
    2 0 0 113562280 200600 15967184 0 0 0 1483 4781 1727 14 1 84 0 0 2018年11月27日 14:29:20
    3 0 0 113564520 200604 15962952 0 0 0 1123 4847 1790 15 2 83 0 0 2018年11月27日 14:29:21
    3 0 0 113565664 200604 15962020 0 0 0 800 4512 1646 15 2 84 0 0 2018年11月27日 14:29:22
    >>>>>>>> START: COPY ... FROM STDIN bash script (Tue Nov 27 14:29:23 EST 2018)
    3 0 0 113566992 200604 15960036 0 0 0 760 4730 1655 14 2 84 0 0 2018年11月27日 14:29:23
    3 1 0 113551160 200604 15961608 0 0 0 9704 5271 5025 15 3 82 1 0 2018年11月27日 14:29:24
    3 0 0 113548176 200604 15966864 0 0 0 8764 6269 6335 19 3 77 1 0 2018年11月27日 14:29:25
    3 0 0 113552104 200604 15969684 0 0 0 7744 6331 5648 19 3 77 1 0 2018年11月27日 14:29:26
    1 0 0 113521248 200604 15975508 0 0 0 8624 3717 5478 9 2 89 0 0 2018年11月27日 14:29:27
    2 0 0 113535536 200604 15976840 0 0 0 12563 4982 8784 11 3 86 1 0 2018年11月27日 14:29:28
    3 0 0 113535640 200604 15978772 0 0 0 11223 5473 6182 12 3 84 1 0 2018年11月27日 14:29:29
    2 0 0 113533576 200604 15977312 0 0 0 11180 5032 6443 12 3 85 0 0 2018年11月27日 14:29:30
    2 0 0 113534384 200604 15978180 0 0 0 11169 4961 6511 12 3 86 0 0 2018年11月27日 14:29:31
    2 0 0 113504656 200604 16004428 0 0 0 32691 4551 13584 10 3 84 2 0 2018年11月27日 14:29:32
    2 0 0 113486672 200604 16023572 0 0 0 26133 4387 8803 10 3 86 2 0 2018年11月27日 14:29:33
    4 0 0 113459744 200604 16033296 0 0 0 12535 5709 8188 17 3 80 1 0 2018年11月27日 14:29:34
    1 0 0 113444128 200604 16057612 0 0 0 78953 8980 10186 12 4 82 2 0 2018年11月27日 14:29:35
    1 1 0 113415520 200604 16087380 0 0 0 23640 5576 8781 15 3 80 1 0 2018年11月27日 14:29:36
    3 0 0 113400864 200604 16108080 0 0 0 14733 3348 10182 6 2 90 1 0 2018年11月27日 14:29:37
    <<<<<<<< END : COPY ... FROM STDIN bash script (Tue Nov 27 14:29:37 EST 2018)
    1 0 0 113393552 200632 16127372 0 0 0 56640 4456 4481 4 3 93 1 0 2018年11月27日 14:29:38
    0 1 0 113392368 200632 16127680 0 0 0 1688 2446 1406 2 1 96 0 0 2018年11月27日 14:29:39
    1 0 0 113391032 200632 16128472 0 0 0 1616 2372 1396 2 1 97 0 0 2018年11月27日 14:29:40
    0 0 0 113389376 200632 16128440 0 0 0 1896 2474 1402 2 1 96 0 0 2018年11月27日 14:29:41
    

Current

  • When binding a multi-line prepared statement within a single transaction block (see code), it takes 390 sec to INSERT the same rows into each of the 28 partitioned tables for total throughput of 9 rows/ sec (compared to the 22,234 rows/sec in the baseline)
  • The max/avg/min values from vmstat -wt 1 during the INSERT resulting from DBI::dbBind() appear as follows:

    procs -----------------------memory---------------------- ---swap-- -----io---- --system-- --------cpu--------
    r b swpd free buff cache si so bi bo in cs us sy id wa st 
    6 1 0 112348416 200092 15661572 0 0 0 10090 12259 37796 32 3 81 4 0 # MAX
    3 0 0 111860781 199437 15608955 0 0 0 1708 5658 6371 21 1 78 0 0 # AVG
    3 0 0 111353344 199372 15572860 0 0 0 184 4115 1602 17 0 67 0 0 # MIN
    

Other notes:

  • Performance unaffected when running single or multiple R child threads (ie. one or multiple database client connections).
  • Have tried to use RPostgres, RPostgreSQL, and odbc R packages as database drivers (while slight performance differences exist among the drivers, no implementation-specific artifacts appear to contribute significantly to the processing time).
  • (削除) Working on EXPLAIN ANALYZE of SQL, but (削除ここまで) welcome thoughts ahead of time on potential kernel settings/ database configuration/ SystemD settings/ etc.

Thank you in advance.

Update (2018年11月28日)

  • Created an SQL flat file with the INSERT statements produced by the R script (placed the statements within a BEGIN ... COMMIT transaction block) and loaded using psql -f <file>.sql -- no performance improvement.
  • Ran EXPLAIN ANALYZE VERBOSE on single INSERT, with following output:

    Insert on <tablename> (cost=0.00..0.26 rows=1 width=1952) (actual time=0.074..0.074 rows=0 loops=1)
     -> Result (cost=0.00..0.26 rows=1 width=1952) (actual time=0.006..0.007 rows=1 loops=1)
     Output: <row data>
    Planning time: 0.303 ms
    Trigger RI_ConstraintTrigger_c_274329 for constraint <foreign_key_constraint_name> on <child_tablename>: time=0.071 calls=1
    Execution time: 13.345 ms
    
asked Nov 28, 2018 at 2:56
4
  • If I read this correctly the time is mostly dominated by the FK „trigger", did you try how fast it is without this constraint? Commented Nov 28, 2018 at 6:06
  • 1
    Thank you for the comment! I noticed that as well, but wasn't certain if the FK checks represented per-statement or per-row costs. After making the SQL statement change as noted in the self-answer and re-running for a bulk INSERT of 4,190 rows on a partitioned table, I found that the cost increased per-row, but not linearly. I was able to INSERT the 4,190 rows with a total 125ms cost (involving the 180 child tables). I have a bit more profiling to do (and yet to see if the cost is manageable at scale) to see if FK constraint check optimization worthwhile. All the best. Commented Nov 28, 2018 at 19:05
  • Are all foreign key columns indexed? The usual reason for slow RI_ConstraintTriggers is that the FK columns isn't indexed and every check is doing a Seq Scan on that table rather than an index lookup Commented Nov 28, 2018 at 20:53
  • I believe so (the two foreign key columns are the primary key of the foreign table, so I believe the primary key automatically indexes the two columns as a group -- if not, I may then need to add an index). The foreign key constraint references both columns (no partial indices or per-column indices are defined). Is there a way to determine whether the planner selects a sequential scan vs. index lookup? Commented Nov 28, 2018 at 23:31

1 Answer 1

2

A change to the prepared SQL statement resulted in a six-fold improvement.

SQL changed from:

 INSERT INTO <table> (<col_1>,...,<col_n>) VALUES (?,...,?);
 ...
 INSERT INTO <table> (<col_1>,...,<col_n>) VALUES (?,...,?);

to:

 INSERT INTO <table> (<col_1>,...,<col_n>) VALUES
 (?,...,?),
 ...
 (?,...,?);

In other words, I collapsed the multiple INSERT statements into a single statement, which reduced the time from 390 sec to 66 sec, or 56 rows/sec. While still on the slow side, will need to re-profile the execution to see if the remaining latency is application- or database-related.

answered Nov 28, 2018 at 18:55
1
  • The true cause is probably the foreign key reference missing index for the source column of that reference. You didn't show schema for the tables used (including foreign key tables that refer to the table you're using or foreign keys from your table to other tables) but I would guess that's the culprit. See dba.stackexchange.com/a/226481/29183 for a related example. Commented Jun 19, 2024 at 13:59

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.