AWS RDS Serverless-V2 Postgres restart issues

Question 1

We have an RDS Serverless-V2 Postgres cluster setup with two instances (one read/write, and the other a read). Both instances are in the same AZ for now, although we plan on moving one to a different AZ soon. Additionally we are using an RDS-Proxy for our Lambdas to connect to the db (so shared conns can be used). The ACUs are set to run 0.5 to 8

We are currently perf-testing this setup for possible use in an application (replacing our current DynamoDB db). During these perf tests we are processing (writing) to about 3 million records, while at the same time we are measuring our API (REST GET calls) performance on a few of our main queries. The writes (updates) are very simple where a single field in a record is being updated (a small string field). At the same time our queries are relatively simple (a few tables with simple joins - nothing complex). The queries are being run 10 at a time asynchronously using Jmeter.

The problem is that we are getting intermittent restarts of one (and sometimes both) of the instances. During these restarts the other instance takes over as expected however this has a noticeable impact on our queries which are starting to time-out (15 second timeouts on the handling Lambdas) until the failed instance comes back online (usually a few minutes).

Here are some of the log outputs when one of these restarts happens:

postgres@test:[23685]:DETAIL: The postmaster has commanded this
 server process to roll back the current transaction and exit, because
 another server process exited abnormally and possibly corrupted
 shared memory.
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.253.103(40115):rdsproxyadmin@postgres:[23355]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.254.218(49269):rdsproxyadmin@postgres:[23348]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.254.218(38211):rdsproxyadmin@postgres:[23354]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:[local]:rdsadmin@rdsadmin:[23352]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:[local]:rdsadmin@rdsadmin:[23347]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC:10.10.253.103(32051):rdsproxyadmin@postgres:[23346]:FATAL: the database system is starting up
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC::@:[23345]:LOG: database system was interrupted; last known up at 2022年06月16日 16:48:21 UTC
2022年06月16日T13:48:48.000-03:00 2022年06月16日 16:48:48 UTC::@:[22011]:LOG: In Aurora for PostgreSQL read only hot standby mode. Database system is ready to accept read only connections.
2022年06月16日T13:48:50.000-03:00 2022年06月16日 16:48:50 UTC:10.10.253.103(7795):postgres@test:[24095]:LOG: unexpected EOF on client connection with an open transaction

We're very pleased so far with the performance, but this stability issue has us very concerned. Any insights are appreciated.

Thanks

Question 2

Have you tried contacting AWS support? It's unlikely that anyone else can troubleshoot a bug in a proprietary application.

Question 3

Those messages are responses to the restart. The cause should be farther back in the logs.

Question 4

Turns out that our issue was related to our minimum ACU being set to 0.5. This did not allow enough cpu or memory capacity when high-level write and read demands were made.

As per this AWS best-practices guide:

"If there is a sudden spike in requests, you can overwhelm the database. Aurora Serverless might not be able to find a scaling point and scale quickly enough due to a shortage of resources. This is especially true when your cluster is at 1 ACU capacity, which corresponds to approximately 2 GB of memory. Typically 1 ACU is not adequate for production workloads."

We have since rerun our perf test using minimum 6 and maximum 8 ACUs and had no issues. From here we'll start to fine-tune our range to minimize costs and increase reliability.

Gatmando Gatmando 1212 bronze badges · Answer 1 · 2022-06-17 18:38:18Z

Turns out that our issue was related to our minimum ACU being set to 0.5. This did not allow enough cpu or memory capacity when high-level write and read demands were made.

As per this AWS best-practices guide:

"If there is a sudden spike in requests, you can overwhelm the database. Aurora Serverless might not be able to find a scaling point and scale quickly enough due to a shortage of resources. This is especially true when your cluster is at 1 ACU capacity, which corresponds to approximately 2 GB of memory. Typically 1 ACU is not adequate for production workloads."

We have since rerun our perf test using minimum 6 and maximum 8 ACUs and had no issues. From here we'll start to fine-tune our range to minimize costs and increase reliability.

Stack Exchange Network

AWS RDS Serverless-V2 Postgres restart issues

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

AWS RDS Serverless-V2 Postgres restart issues

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions