What are the consequences of not specifying NOT NULL in PostgreSQL for fields which can't be null?

Question 1

I have an application (data is stored in PostgreSQL), where majority of the fields in the tables are always not null, but the schema for these tables does not enforce this. For example look at this fake table:

CREATE TABLE "tbl" (
 "id" serial,
 "name" varchar(40),
 "num" int,
 "time" timestamp
 PRIMARY KEY ("id"),
 UNIQUE ("id")
);

Also name, num, time are not explicitly stated as NOT NULL, in reality they are, because the enforcement happens on the application side.

My feeling is that it should be changed, but the counterpoint is that the application level makes sure that null values can't appear here and no one else manually modifies the table.

My question is: What are the benefits (performance, storage, consistency, something else) and drawbacks (assuming that I already verified that there are no nulls present at the moment, and from the business logic there should be no nulls) by setting an explicit NOT NULL constraint?

We have a good code review process and a reasonably good documentation, so the possibility that some new person would commit something that breaks this constraint is not really enough to justify the change.

This is not my decision, so this is exactly why I am looking for other justifications. In my opinion, if something can't be null and a database allows you to specify that something is not null - then just do it. Especially if the change is super simple.

Question 2

See this answer for Nulls and disk space considerations: stackoverflow.com/questions/5008753/… In short, if your table has more than 8 columns and at least 1 nullable column, the table will need more bytes per row than if all columns are defined not null.

Question 3

@ypercubeTM: To be precise, the null bitmap is only added per row if there is an actual null value in the row: stackoverflow.com/a/7654497/939860. Therefore, NOT NULL constraints do not have any direct effect on storage size. Of course, with all columns being defined NOT NULL, there cannot be a null bitmap to begin with. On the other hand: storage size is typically much smaller if you use NULL instead of "empty" or dummy values for columns without actual value, because the null bitmap is comparatively much smaller (except for rare edge cases).

Question 4

@ErwinBrandstetter my bad then, hadn't understand that part. So for columns that have no null values, there is no real difference in storage, whether you define them as NULL or NOT NULL, correct? Is that the same for index storage space, too?

Question 5

"the application level makes sure that null values can't appear here" No, it doesn't. It might make sure that one application doesn't insert nulls. But I have psql (for example), and I can insert nulls both deliberately and accidentally without your application knowing about it.

Question 6

The only application that can make sure no one manually modifies the table is the dbms itself.

Question 7

What happens when a new programmer arrives and has to write an app against that db? They don't know that field x has to be NOT NULL.

Another programme might assume that all field x's are NOT NULL for performing counts say, but some now are NULL because of the new programme, leading to inconsistent and difficult to trace errors.

IMHO it is always best to enforce data integrity rules as near to the data as possible, i.e. in the database. That way, new apps and/or programmers can't mess up your data.

Programmers, applications, languages and frameworks come and go. Data and databases tend to persist. The database is your last line of defence against inconsistent, potentially erroneous data.

Make maximum use of your database's integrity constraint enforcement mechanisms, even at the expense of performance. A slow system that produces correct results is infinitely superior to a fast one that gets things wrong!

Question 8

IMHO it is always best to enforce data integrity rules as near to the data as possible this is actually the same as the gut feeling I wrote about. And this is exactly why I am looking for real justifications. We have code review in place and good documentation, so concerns about a new developer not knowing something are not enough to justify the change.

Question 9

Code reviews and good documentation does not guarantee you against (programming or other) errors.

Question 10

And how many REAL PROGRAMMERS read all (or even any) of the documentation before getting stuck into a prject where they're on a tight deadline?

Question 11

I once did a review in a bank that had the same attitude for their data warehouse. In their case - no referential integrity. Well, happens 40% of the older data was garbage because someone had not read the documentation and deleted data in lookup tables. You do not trust code reviews and documentation with data integrity - you make it explicit in the database.

Question 12

As already cited by others in comments, adding NOT NULL to your table specification can improve in a significant way the performances of your queries (in addition to the very good methodological reasons stated in another answer).

The reason is that the query optimizer, knowing that a column cannot have a NULL value, can exclude special tests for such values, like in the NOT IN vs. NOT EXISTS case. You can see for instance this blog, where it is shown that not declaring a field NOT NULL (when the table contains always non null values) with a certain query increases the execution time of 500%. The result is shown for SQL Server, but a similar behaviour could be present in other relational DBMSs, like yours (not to mention the fact that your database could be ported to other systems). A general rule that you can assume is that when more information is available to the query optimizer, then more efficient access plans can be produced.

Question 13

Thank you. This is the type of answer I was looking for.

Question 14

Columns that never contain NULL, should be defined NOT NULL for multiple reasons, no argument about that. But the link to the blog about SQL Server is not applicable for Postgres and does not prove any of the performance implications you mention. Not saying there aren't any, but I would love to see actual evidence.

Question 15

@ErwinBrandstetter, I had a much high expections about the PostgreSQL optimizer :( After several tests I didn't find significant differences in the NOT IN query presented in the blog in PostgreSQL with and without a NOT NULL constraint. So, I changed the answer, and am asking you if you think that I should delete it altogether.

Question 16

No, I don't think it should be deleted. It has 5 + votes and no downvote, for one.

Question 17

The semantics of not in for nullable columns is different though so there must be some difference in the plan between the two?

Question 18

Space implications

The space implications are talked about in this post by @Erwin Brandstetter

In short, you will save one totalColumns - 8 bits rounded up to the nearest byte (or MAXALIGN), if your database has

More than 8 columns
ALL columns on the table are NOT NULL

Performance implications

However, in this post on SE by @Erwin Brandstetter, he says

"Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant."
"... by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up .. everything."

@Renzo has an answer that talks about the performance implications -- I would assume none of that is applicable to PostgreSQL. I can't find anything that substantiates any of that as being relevant to PostgreSQL. Whatever cycles are saved can not be quantified in even the most rudimentary query.

CREATE TABLE foo (
 a int,
 b int NOT NULL,
 x float,
 y float NOT NULL
);
INSERT INTO foo ( a, b, x, y )
SELECT x, x, x, x
FROM generate_series(1,1E7) AS X(x);
EXPLAIN ANALYZE SELECT 1/a FROM foo;
EXPLAIN ANALYZE SELECT 1/b FROM foo;
EXPLAIN ANALYZE SELECT 1/x FROM foo;
EXPLAIN ANALYZE SELECT 1/y FROM foo;

In addition I ran some tests to see if NULL-indexes were ever faster, and I couldn't substantiate that. You can find this awesomely useful thread by Scott Marlowe on the mailing lists which talks about the query planner in 9.1 being able to use partial index on dissimilar WHERE clauses. I tested this by running the following

CREATE TABLE foo ( a int );
CREATE TABLE bar ( a int NOT NULL );
INSERT INTO foo
 SELECT null FROM generate_series(1,1e5) AS x
 UNION ALL
 SELECT 10
 UNION ALL
 SELECT null FROM generate_series(1,1e5) AS x
;
INSERT INTO bar
 SELECT 0 FROM generate_series(1,1e5) AS x
 UNION ALL
 SELECT 10
 UNION ALL
 SELECT 0 FROM generate_series(1,1e5) AS x
;

Now I created the indexes,

CREATE INDEX foobar ON foo(a) WHERE a IS NOT NULL;
CREATE INDEX barbar ON bar(a) WHERE a <> 0;

In both of these cases the planner was able to use the index when selecting for = 10 and used a seq scan when searching for NULL or 0 respectively. Both partial indexes were the same size. And, full indexes (not shown) were the same size. Following the same methodology I loaded up the table with one sequence of 1..1e5, and the one null/0 value, and another sequence of 1..1e5. Both methods were able to find the null/0 with an index covering the whole table.

TLDR; Summary

I can't substantiate anything one way or another on most of the performance concerns that I thought were worth testing for including planner inadequacies. The benefit to using null to save ram is real. The disk space saved by not using null is negligible, and that's an overstatement on tables with one NULLABLE column, or fewer than 8 columns. In those cases there is no disk space saved.

Vérace Vérace 31k9 gold badges73 silver badges86 bronze badges · Answer 1 · 2016-06-04 21:34:14Z

What happens when a new programmer arrives and has to write an app against that db? They don't know that field x has to be NOT NULL.

Another programme might assume that all field x's are NOT NULL for performing counts say, but some now are NULL because of the new programme, leading to inconsistent and difficult to trace errors.

IMHO it is always best to enforce data integrity rules as near to the data as possible, i.e. in the database. That way, new apps and/or programmers can't mess up your data.

Programmers, applications, languages and frameworks come and go. Data and databases tend to persist. The database is your last line of defence against inconsistent, potentially erroneous data.

Make maximum use of your database's integrity constraint enforcement mechanisms, even at the expense of performance. A slow system that produces correct results is infinitely superior to a fast one that gets things wrong!

IMHO it is always best to enforce data integrity rules as near to the data as possible this is actually the same as the gut feeling I wrote about. And this is exactly why I am looking for real justifications. We have code review in place and good documentation, so concerns about a new developer not knowing something are not enough to justify the change.
Code reviews and good documentation does not guarantee you against (programming or other) errors.
And how many REAL PROGRAMMERS read all (or even any) of the documentation before getting stuck into a prject where they're on a tight deadline?
I once did a review in a bank that had the same attitude for their data warehouse. In their case - no referential integrity. Well, happens 40% of the older data was garbage because someone had not read the documentation and deleted data in lookup tables. You do not trust code reviews and documentation with data integrity - you make it explicit in the database.

Renzo Renzo 4,08014 silver badges17 bronze badges · Answer 2 · 2016-06-05 07:46:55Z

As already cited by others in comments, adding NOT NULL to your table specification can improve in a significant way the performances of your queries (in addition to the very good methodological reasons stated in another answer).

The reason is that the query optimizer, knowing that a column cannot have a NULL value, can exclude special tests for such values, like in the NOT IN vs. NOT EXISTS case. You can see for instance this blog, where it is shown that not declaring a field NOT NULL (when the table contains always non null values) with a certain query increases the execution time of 500%. The result is shown for SQL Server, but a similar behaviour could be present in other relational DBMSs, like yours (not to mention the fact that your database could be ported to other systems). A general rule that you can assume is that when more information is available to the query optimizer, then more efficient access plans can be produced.

Columns that never contain NULL, should be defined NOT NULL for multiple reasons, no argument about that. But the link to the blog about SQL Server is not applicable for Postgres and does not prove any of the performance implications you mention. Not saying there aren't any, but I would love to see actual evidence.
@ErwinBrandstetter, I had a much high expections about the PostgreSQL optimizer :( After several tests I didn't find significant differences in the NOT IN query presented in the blog in PostgreSQL with and without a NOT NULL constraint. So, I changed the answer, and am asking you if you think that I should delete it altogether.
No, I don't think it should be deleted. It has 5 + votes and no downvote, for one.
The semantics of not in for nullable columns is different though so there must be some difference in the plan between the two?

Evan Carroll Evan Carroll 65.7k50 gold badges259 silver badges510 bronze badges · Answer 3 · 2016-12-07 09:08:38Z

Space implications

The space implications are talked about in this post by @Erwin Brandstetter

In short, you will save one totalColumns - 8 bits rounded up to the nearest byte (or MAXALIGN), if your database has

More than 8 columns
ALL columns on the table are NOT NULL

Performance implications

However, in this post on SE by @Erwin Brandstetter, he says

"Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant."
"... by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up .. everything."

@Renzo has an answer that talks about the performance implications -- I would assume none of that is applicable to PostgreSQL. I can't find anything that substantiates any of that as being relevant to PostgreSQL. Whatever cycles are saved can not be quantified in even the most rudimentary query.

CREATE TABLE foo (
 a int,
 b int NOT NULL,
 x float,
 y float NOT NULL
);
INSERT INTO foo ( a, b, x, y )
SELECT x, x, x, x
FROM generate_series(1,1E7) AS X(x);
EXPLAIN ANALYZE SELECT 1/a FROM foo;
EXPLAIN ANALYZE SELECT 1/b FROM foo;
EXPLAIN ANALYZE SELECT 1/x FROM foo;
EXPLAIN ANALYZE SELECT 1/y FROM foo;

In addition I ran some tests to see if NULL-indexes were ever faster, and I couldn't substantiate that. You can find this awesomely useful thread by Scott Marlowe on the mailing lists which talks about the query planner in 9.1 being able to use partial index on dissimilar WHERE clauses. I tested this by running the following

CREATE TABLE foo ( a int );
CREATE TABLE bar ( a int NOT NULL );
INSERT INTO foo
 SELECT null FROM generate_series(1,1e5) AS x
 UNION ALL
 SELECT 10
 UNION ALL
 SELECT null FROM generate_series(1,1e5) AS x
;
INSERT INTO bar
 SELECT 0 FROM generate_series(1,1e5) AS x
 UNION ALL
 SELECT 10
 UNION ALL
 SELECT 0 FROM generate_series(1,1e5) AS x
;

Now I created the indexes,

CREATE INDEX foobar ON foo(a) WHERE a IS NOT NULL;
CREATE INDEX barbar ON bar(a) WHERE a <> 0;

In both of these cases the planner was able to use the index when selecting for = 10 and used a seq scan when searching for NULL or 0 respectively. Both partial indexes were the same size. And, full indexes (not shown) were the same size. Following the same methodology I loaded up the table with one sequence of 1..1e5, and the one null/0 value, and another sequence of 1..1e5. Both methods were able to find the null/0 with an index covering the whole table.

TLDR; Summary

I can't substantiate anything one way or another on most of the performance concerns that I thought were worth testing for including planner inadequacies. The benefit to using null to save ram is real. The disk space saved by not using null is negligible, and that's an overstatement on tables with one NULLABLE column, or fewer than 8 columns. In those cases there is no disk space saved.

Stack Exchange Network

What are the consequences of not specifying NOT NULL in PostgreSQL for fields which can't be null?

3 Answers 3

Space implications

Performance implications

TLDR; Summary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

What are the consequences of not specifying NOT NULL in PostgreSQL for fields which can't be null?

3 Answers 3

Space implications

Performance implications

TLDR; Summary

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions