We are currently hosting a postgres RDS database and our team is noticing slowup in our querying service. I'm noticing a spike in the metric, CheckpointLag
and I've been tasked in trying to find where this occurs specifically on the AWS side of things.
In monitoring detailed performance, we've seen that our queries are much below (20%
) what our expected average active sessions (AAS) are said to reach. I also monitored the queries individually with EXPLAIN ANALYZE
and the most extreme query is takes 0.5s
to compute. This leads me to believe there's something else taking too long.
After checking other potential metrics, CPU, BurstBalance, etc... all appear normal, there is one metric CheckpointLag
which appears to have a spike under use and I can't seem to find documentation on. I can't seem to find what this means and the expected acceptable value we should expect with a db.m4.xLarge
. With no, to low, usage -- it appears to be ~140 seconds
. Under normal, expected usage it jumps to ~400 seconds
.
I'm asking what this metric really means, if the values are of expected or normal values, and if there's any other ways I can see if my RDS instance is the cause of my slowup?
EDIT:
Checkpoint lag is defined as a metric here: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-metrics.html with the description of The amount of time since the most recent checkpoint
. It was fairly vague and hard to decipher the true meaning. For my metrics, it appears that its pulling from this already pre-defined metric, but if there's a way to dive deeper in how its querying the instance, please let me know.
Follow-Up I ended up editing queries to group results and reduce the number of rows being exported at one time as our team was querying way too many rows to begin with. With this, CheckpointLag went down and I associated it with time taken to either reach or perform queries on RDS (duh!), but I still have not pinpointed exact meaning. There must've been some bottleneck in outputting all of the rows and cause the "lag" to rise...
1 Answer 1
According to the doc https://www.postgresql.org/docs/current/sql-checkpoint.html checkpoint is a flush to disk operation in Postgres.
From that page:
CHECKPOINT — force a write-ahead log checkpoint Synopsis
A checkpoint is a point in the write-ahead log sequence at which all data files have been updated to reflect the information in the log. All data files will be flushed to disk. Refer to Section 28.5 for more details about what happens during a checkpoint.
The CHECKPOINT command forces an immediate checkpoint when the command is issued, without waiting for a regular checkpoint scheduled by the system (controlled by the settings in Section 19.5.2). CHECKPOINT is not intended for use during normal operation.
If executed during recovery, the CHECKPOINT command will force a restartpoint (see Section 28.5) rather than writing a new checkpoint.
Only superusers or users with the privileges of the pg_checkpoint role can call CHECKPOINT.
So, Checkpoint Lag is likely the amount of time the engine is waiting for the disk to complete and acknowledge writes when it has issued an automatic or manual checkpoint.
Explore related questions
See similar questions with these tags.