I've been browsing SQL dumps of some famous CMSes, including Drupal 7, Wordpress (some quite very old version), and some custom application based on Python.
All of these dumps contained data with string flags instead of integer ones. For example, a post's status was represented as published
, closed
, or inherit
rather than 1
, 2
, or 3
.
I have quite limited experience in design of databases and I have never went past simple SQLs, but I was always taught that I should use numeric/integer flags for data like this. It's obvious that tinyint
consumes much less space in a database than, for example, varchar(9)
.
So what am I missing? Isn't this a waste of data storage and a data redundancy? Wouldn't browsing, searching and indexing be a bit faster if these columns used integers instead of strings?
2 Answers 2
Yes, storing strings instead of numbers can use more space. The reason that high-profile pltforms are doing it anyway is that they think the benefits of that solution are greater than the cost.
What are the benefits? You can easily read a database dump and understand what it's about without memorizing the enum tables, and even semi-official GUIs might simply use the values themeselves rather than transform the record they get. (This is a basic form of disk space/processing time tradeoff.)
What about the cost? Data storage capacity hasn't been the bottleneck in CMS for a long time, since disks have gotten so large and so cheap. Programmer time, on the other hand, usually becomes more expensive - so anything that trades development effort for disk space is also a good thing, from a business perspective.
Yes, storing things such as yes
or true
will take more space than a tinyint. This should not be surprising. It also makes indexing and thus joins less efficient for the database. It also has the penalty of possible confusion for what is the correct value (yes
vs y
).
However, there are many approaches that look similar to storing strings in the database (in particular MySQL) that are efficient.
First, MySQL has an enum
type (docs) which can look very much like a Boolean or restricted set of strings when set up that way. It also enforces only valid values are entered. This is often much more useful than storing 1
, 2
or 3
as a value as the meaning is conveyed with the information. The enum comes with the penalty that a schema change is required to add or remove types.
This brings us to a child table and foreign keys (applicable to all databases). Yes, you are storing some value as a key (back to the 1
, 2
or 3
) and the value of published
, closed
, and inherit
are stored in another table. Using a view (docs) it is then possible to make it look like the table contains the string rather than the key. This has the advantage that no schema change is required to add or remove entries from the child table.
Exactly how things are stored would require one to look at the actual DDL of the schema to determine what method is used and get some hint of which trade offs they have selected.
Explore related questions
See similar questions with these tags.
char
s? How inefficient!" is outdated compared to 2015.