1

We have an RDS Serverless-V2 Postgres cluster setup with two instances (one read/write, and the other a read). Both instances are in the same AZ for now, although we plan on moving one to a different AZ soon. Additionally we are using an RDS-Proxy for our Lambdas to connect to the db (so shared conns can be used). The ACUs are set to run 0.5 to 8

We are currently perf-testing this setup for possible use in an application (replacing our current DynamoDB db). During these perf tests we are processing (writing) to about 3 million records, while at the same time we are measuring our API (REST GET calls) performance on a few of our main queries. The writes (updates) are very simple where a single field in a record is being updated (a small string field). At the same time our queries are relatively simple (a few tables with simple joins - nothing complex). The queries are being run 10 at a time asynchronously using Jmeter.

The problem is that we are getting intermittent restarts of one (and sometimes both) of the instances. During these restarts the other instance takes over as expected however this has a noticeable impact on our queries which are starting to time-out (15 second timeouts on the handling Lambdas) until the failed instance comes back online (usually a few minutes).

Here are some of the log outputs when one of these restarts happens:

postgres@test:[23685]:DETAIL: The postmaster has commanded this
 server process to roll back the current transaction and exit, because
 another server process exited abnormally and possibly corrupted
 shared memory.
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.253.103(40115):rdsproxyadmin@postgres:[23355]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.254.218(49269):rdsproxyadmin@postgres:[23348]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.254.218(38211):rdsproxyadmin@postgres:[23354]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:[local]:rdsadmin@rdsadmin:[23352]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:[local]:rdsadmin@rdsadmin:[23347]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.253.103(32051):rdsproxyadmin@postgres:[23346]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC::@:[23345]:LOG: database system was interrupted; last known up at 2022年06月16日 16:48:21 UTC
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC::@:[22011]:LOG: In Aurora for PostgreSQL read only hot standby mode. Database system is ready to accept read only connections.
2022年06月16日T13:48:50.000-03:00 2022年06月16日 16:48:50 UTC:10.10.253.103(7795):postgres@test:[24095]:LOG: unexpected EOF on client connection with an open transaction

We're very pleased so far with the performance, but this stability issue has us very concerned. Any insights are appreciated.

Thanks

asked Jun 16, 2022 at 17:25
2
  • 1
    Have you tried contacting AWS support? It's unlikely that anyone else can troubleshoot a bug in a proprietary application. Commented Jun 16, 2022 at 18:04
  • Those messages are responses to the restart. The cause should be farther back in the logs. Commented Jun 16, 2022 at 23:25

1 Answer 1

1

Turns out that our issue was related to our minimum ACU being set to 0.5. This did not allow enough cpu or memory capacity when high-level write and read demands were made.

As per this AWS best-practices guide:

"If there is a sudden spike in requests, you can overwhelm the database. Aurora Serverless might not be able to find a scaling point and scale quickly enough due to a shortage of resources. This is especially true when your cluster is at 1 ACU capacity, which corresponds to approximately 2 GB of memory. Typically 1 ACU is not adequate for production workloads."

We have since rerun our perf test using minimum 6 and maximum 8 ACUs and had no issues. From here we'll start to fine-tune our range to minimize costs and increase reliability.

answered Jun 17, 2022 at 18:38

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.