3 tips to manage large Postgres databases

Try these handy solutions to common problems when dealing with huge databases.

Green graph of measurements

Image by:

Internet Archive Book Images. Modified by Opensource.com. CC BY-SA 4.0

The relational database PostgreSQL (also known as Postgres) has grown increasingly popular, and enterprises and public sectors use it across the globe. With this widespread adoption, databases have become larger than ever. At Crunchy Data, we regularly work with databases north of 20TB, and our existing databases continue to grow. My colleague David Christensen and I have gathered some tips about managing a database with huge tables.

Big tables

Production databases commonly consist of many tables with varying data, sizes, and schemas. It's common to end up with a single huge and unruly database table, far larger than any other table in your database. This table often stores activity logs or time-stamped events and is necessary for your application or users.

Really large tables can cause challenges for many reasons, but a common one is locks. Regular maintenance on a table often requires locks, but locks on your large table can take down your application or cause a traffic jam and many headaches. I have a few tips for doing basic maintenance, like adding columns or indexes, while avoiding long-running locks.

Adding indexes problem: Index creation locks the table for the duration of the creation process. If you have a massive table, this can take hours.

CREATE INDEX ON customers (last_name)

Solution: Use the CREATE INDEX CONCURRENTLY feature. This approach splits up index creation into two parts, one with a brief lock to create the index that starts tracking changes immediately but minimizes application blockage, followed by a full build-out of the index, after which queries can start using it.

CREATE INDEX CONCURRENTLY ON customers (last_name)

Adding columns

Adding a column is a common request during the life of a database, but with a huge table, it can be tricky, again, due to locking.

Problem: When you add a new column with a default that calls a function, Postgres needs to rewrite the table. For big tables, this can take several hours.

Solution: Split up the operation into multiple steps with the total effect of the basic statement, but retain control of the timing of locks.

Add the column:

ALTER TABLE all_my_exes ADD COLUMN location text

Add the default:

ALTER TABLE all_my_exes ALTER COLUMN location
SET DEFAULT texas()

Use UPDATE to add the default:

UPDATE all_my_exes SET location = DEFAULT

Adding constraints

Problem: You want to add a check constraint for data validation. But if you use the straightforward approach to adding a constraint, it will lock the table while it validates all of the existing data in the table. Also, if there's an error at any point in the validation, it will roll back.

ALTER TABLE favorite_bands 
ADD CONSTRAINT name_check
CHECK (name = 'Led Zeppelin')

Skip to bottom of list

Open source and data science

What is data science?

What is Python?

Data scientist: A day in the life

Try OpenShift Data Science

MariaDB and MySQL cheat sheet

Latest data science articles

Solution: Tell Postgres about the constraint but don't validate it. Validate in a second step. This will take a short lock in the first step, ensuring that all new/modified rows will fit the constraint, then validate in a separate pass to confirm all existing data passes the constraint.

Tell Postgres about the constraint but do not to enforce it:

ALTER TABLE favorite_bands
ADD CONSTRAINT name_check
CHECK (name = 'Led Zeppelin') NOT VALID

Then VALIDATE it after it's created:

ALTER TABLE favorite_bands VALIDATE CONSTRAINT name_check

Hungry for more?

David Christensen and I will be in Pasadena, CA, at SCaLE's Postgres Days, March 9-10. Lots of great folks from the Postgres community will be there too. Join us!

Comments are closed.

These comments are closed.