I have an application (data is stored in PostgreSQL), where majority of the fields in the tables are always not null, but the schema for these tables does not enforce this. For example look at this fake table:
CREATE TABLE "tbl" (
"id" serial,
"name" varchar(40),
"num" int,
"time" timestamp
PRIMARY KEY ("id"),
UNIQUE ("id")
);
Also name
, num
, time
are not explicitly stated as NOT NULL
, in reality they are, because the enforcement happens on the application side.
My feeling is that it should be changed, but the counterpoint is that the application level makes sure that null values can't appear here and no one else manually modifies the table.
My question is: What are the benefits (performance, storage, consistency, something else) and drawbacks (assuming that I already verified that there are no nulls present at the moment, and from the business logic there should be no nulls) by setting an explicit NOT NULL
constraint?
We have a good code review process and a reasonably good documentation, so the possibility that some new person would commit something that breaks this constraint is not really enough to justify the change.
This is not my decision, so this is exactly why I am looking for other justifications. In my opinion, if something can't be null and a database allows you to specify that something is not null - then just do it. Especially if the change is super simple.
3 Answers 3
What happens when a new programmer arrives and has to write an app against that db? They don't know that field x has to be NOT NULL
.
Another programme might assume that all field x's are NOT NULL
for performing counts say, but some now are NULL
because of the new programme, leading to inconsistent and difficult to trace errors.
IMHO it is always best to enforce data integrity rules as near to the data as possible, i.e. in the database. That way, new apps and/or programmers can't mess up your data.
Programmers, applications, languages and frameworks come and go. Data and databases tend to persist. The database is your last line of defence against inconsistent, potentially erroneous data.
Make maximum use of your database's integrity constraint enforcement mechanisms, even at the expense of performance. A slow system that produces correct results is infinitely superior to a fast one that gets things wrong!
-
1
IMHO it is always best to enforce data integrity rules as near to the data as possible
this is actually the same as the gut feeling I wrote about. And this is exactly why I am looking for real justifications. We have code review in place and good documentation, so concerns about a new developer not knowing something are not enough to justify the change.Salvador Dali– Salvador Dali2016年06月04日 21:54:49 +00:00Commented Jun 4, 2016 at 21:54 -
4Code reviews and good documentation does not guarantee you against (programming or other) errors.ypercubeᵀᴹ– ypercubeᵀᴹ2016年06月05日 16:05:37 +00:00Commented Jun 5, 2016 at 16:05
-
2And how many
REAL PROGRAMMERS
read all (or even any) of the documentation before getting stuck into a prject where they're on a tight deadline?Vérace– Vérace2016年06月06日 09:37:04 +00:00Commented Jun 6, 2016 at 9:37 -
3I once did a review in a bank that had the same attitude for their data warehouse. In their case - no referential integrity. Well, happens 40% of the older data was garbage because someone had not read the documentation and deleted data in lookup tables. You do not trust code reviews and documentation with data integrity - you make it explicit in the database.TomTom– TomTom2016年06月06日 14:21:21 +00:00Commented Jun 6, 2016 at 14:21
As already cited by others in comments, adding NOT NULL
to your table specification can improve in a significant way the performances of your queries (in addition to the very good methodological reasons stated in another answer).
The reason is that the query optimizer, knowing that a column cannot have a NULL
value, can exclude special tests for such values, like in the NOT IN
vs. NOT EXISTS
case. You can see for instance this blog, where it is shown that not declaring a field NOT NULL
(when the table contains always non null values) with a certain query increases the execution time of 500%. The result is shown for SQL Server, but a similar behaviour could be present in other relational DBMSs, like yours (not to mention the fact that your database could be ported to other systems). A general rule that you can assume is that when more information is available to the query optimizer, then more efficient access plans can be produced.
-
Thank you. This is the type of answer I was looking for.Salvador Dali– Salvador Dali2016年06月05日 09:16:25 +00:00Commented Jun 5, 2016 at 9:16
-
6Columns that never contain NULL, should be defined
NOT NULL
for multiple reasons, no argument about that. But the link to the blog about SQL Server is not applicable for Postgres and does not prove any of the performance implications you mention. Not saying there aren't any, but I would love to see actual evidence.Erwin Brandstetter– Erwin Brandstetter2016年06月05日 15:21:58 +00:00Commented Jun 5, 2016 at 15:21 -
@ErwinBrandstetter, I had a much high expections about the PostgreSQL optimizer :( After several tests I didn't find significant differences in the NOT IN query presented in the blog in PostgreSQL with and without a NOT NULL constraint. So, I changed the answer, and am asking you if you think that I should delete it altogether.Renzo– Renzo2016年06月06日 14:04:40 +00:00Commented Jun 6, 2016 at 14:04
-
No, I don't think it should be deleted. It has 5 + votes and no downvote, for one.ypercubeᵀᴹ– ypercubeᵀᴹ2016年06月06日 16:12:50 +00:00Commented Jun 6, 2016 at 16:12
-
The semantics of
not in
for nullable columns is different though so there must be some difference in the plan between the two?Martin Smith– Martin Smith2016年06月06日 17:11:22 +00:00Commented Jun 6, 2016 at 17:11
Space implications
The space implications are talked about in this post by @Erwin Brandstetter
In short, you will save one totalColumns - 8
bits rounded up to the nearest byte (or MAXALIGN
), if your database has
- More than 8 columns
- ALL columns on the table are
NOT NULL
Performance implications
However, in this post on SE by @Erwin Brandstetter, he says
- "Setting NOT NULL has no effect per se on performance. A few cycles for the check - irrelevant."
- "... by actually using NULLs instead of dummy values. Depending on data types, you can save a lot of disk space and RAM, thereby speeding up .. everything."
@Renzo has an answer that talks about the performance implications -- I would assume none of that is applicable to PostgreSQL. I can't find anything that substantiates any of that as being relevant to PostgreSQL. Whatever cycles are saved can not be quantified in even the most rudimentary query.
CREATE TABLE foo (
a int,
b int NOT NULL,
x float,
y float NOT NULL
);
INSERT INTO foo ( a, b, x, y )
SELECT x, x, x, x
FROM generate_series(1,1E7) AS X(x);
EXPLAIN ANALYZE SELECT 1/a FROM foo;
EXPLAIN ANALYZE SELECT 1/b FROM foo;
EXPLAIN ANALYZE SELECT 1/x FROM foo;
EXPLAIN ANALYZE SELECT 1/y FROM foo;
In addition I ran some tests to see if NULL-indexes were ever faster, and I couldn't substantiate that. You can find this awesomely useful thread by Scott Marlowe on the mailing lists which talks about the query planner in 9.1 being able to use partial index on dissimilar WHERE clauses. I tested this by running the following
CREATE TABLE foo ( a int );
CREATE TABLE bar ( a int NOT NULL );
INSERT INTO foo
SELECT null FROM generate_series(1,1e5) AS x
UNION ALL
SELECT 10
UNION ALL
SELECT null FROM generate_series(1,1e5) AS x
;
INSERT INTO bar
SELECT 0 FROM generate_series(1,1e5) AS x
UNION ALL
SELECT 10
UNION ALL
SELECT 0 FROM generate_series(1,1e5) AS x
;
Now I created the indexes,
CREATE INDEX foobar ON foo(a) WHERE a IS NOT NULL;
CREATE INDEX barbar ON bar(a) WHERE a <> 0;
In both of these cases the planner was able to use the index when selecting for = 10
and used a seq scan when searching for NULL or 0 respectively. Both partial indexes were the same size. And, full indexes (not shown) were the same size. Following the same methodology I loaded up the table with one sequence of 1..1e5
, and the one null/0 value, and another sequence of 1..1e5
. Both methods were able to find the null/0 with an index covering the whole table.
TLDR; Summary
I can't substantiate anything one way or another on most of the performance concerns that I thought were worth testing for including planner inadequacies. The benefit to using null to save ram is real. The disk space saved by not using null is negligible, and that's an overstatement on tables with one NULLABLE
column, or fewer than 8 columns. In those cases there is no disk space saved.
NOT NULL
constraints do not have any direct effect on storage size. Of course, with all columns being definedNOT NULL
, there cannot be a null bitmap to begin with. On the other hand: storage size is typically much smaller if you use NULL instead of "empty" or dummy values for columns without actual value, because the null bitmap is comparatively much smaller (except for rare edge cases).