I know this question has probably been asked several times before, but my question is specific to my situation.
I am implementing a notification micro service. It has an email
table and a queue
table. The service is exposed using a REST API. Each email can have multiple to
addresses and cc
addresses. The queue has an email ID. Once the email is sent, the row is removed from the queue.
At this time, there is no requirement to query on who sent what emails or whether a email for a particular user with a given email address has been sent. There are multiple ideas on how to/cc email addresses can be stored.
- Store them in JSON format in a address field of the email table:
{
toList: ['h@h','g@g'],
ccList: ['a@b','c@d']
}
I considered using the JSON data type of MySQL 5.7, but it's not supported in h2db which we use for integration tests.
Store them as comma separated values in
toList
,ccList
columns in theemail
table.Store them in a relational table such as
email
,emailaddress
,emailaddresstype
where theemailaddress
table would contain theemailid
, email address and theemailaddressstypeid
. This results in several records inemailaddress
table and also multiple joins.
What would be the best solution to implement in this case? I am generally against violating FNF and I'm not sure what it means to have so many records in an emailaddress
table such as a record for every email sent to, for example, 2 to 3 people. There is no user
table in this system, if that matters.
-
5What's your argument for not normalizing your data? You seem to imply it may be less efficient, but it might be worth reminding you that it's the job of RDBMS to manage relational data. I don't see any strong argument other than you think it may be more efficient to do it that way. The canonical answer to "is it good to store comma-separated values in a single column" is no, unless you know why you need to be doing so.Vincent Savard– Vincent Savard03/22/2017 19:27:23Commented Mar 22, 2017 at 19:27
-
Honestly, that was argument from a team member and I am against that. I like to follow FNF.TechCrunch– TechCrunch03/22/2017 19:28:09Commented Mar 22, 2017 at 19:28
-
1By the way, "I considered using the JSON data type of MySQL 5.7 but its not supported in h2db which we use for integration tests." is a bit concerning. Why would you not use the same RDBMS in your integration tests?Vincent Savard– Vincent Savard03/22/2017 19:30:48Commented Mar 22, 2017 at 19:30
-
Because the tests can be run faster on any machine with out dependencies. There are e2e tests run on SUT that interacts with real db.TechCrunch– TechCrunch03/22/2017 19:34:53Commented Mar 22, 2017 at 19:34
-
3Your situation is in no way different from those of the many previous questions. Simulating collections by storing serialized lists in single columns is a terrible idea in general, and it's a terrible idea in this example as well.Kilian Foth– Kilian Foth03/22/2017 20:37:59Commented Mar 22, 2017 at 20:37
4 Answers 4
I suppose it really depends on what you plan to do with the data. If email addresses are nothing more than lists of strings the you will retrieve and manipulate somewhere else, then comma-separate strings in separate columns ("to", "cc", "bcc") is probably fine.
If you think you might want to query at some point and ask "who was CC'd on Email #1234 but was NOT CC'd on the reply message (#5678)?" it might be difficult to do that with simple comma-separated lists. In that case, you should go with separate tables to properly store the email addresses.
-
1A frequent problem with persisting data in raw format is the deserialization. If by any reason the format change, the code will have to be backward compatible. So it could end up with more code to maintain.Laiv– Laiv03/22/2017 20:51:41Commented Mar 22, 2017 at 20:51
-
1Personally, I prefer to future proof as much as possible, so having a separate table detailing the lists of recipients is a FAR preferable solution, as it allows far more complex queries to be used in the future when requirements change and the business decides they NEED to be able to query "who was CC'd on Email #1234 but was NOT CC'd on the reply message (#5678)?"Maybe_Factor– Maybe_Factor03/23/2017 02:33:50Commented Mar 23, 2017 at 2:33
Is it good to store comma-separated values in single column
Generally speaking, No.
Accessing delimited data values embedded within a larger field generally results in [very] poor application performance. It almost always results in "Table Scans" because indexes can't be meaningfully used.
At this time, there is no requirement to query on who sent what emails or whether a email for a particular user with given email address is sent.
Lucky you.
Wait until you get into a dispute with a user who insists that they're not getting the emails that you [claim to] have sent them, at which point you'll have to start querying it. OK, if those queries table-scan it's not the end of the world if you only have to run a few of them but as soon a this sort of thing becomes common-place, you'll need to move to a more sensible data structure.
OK, preparing for this sort of usage flies in the face of the Developer's "YAGNI" mantra but, all too often, DBAs find themselves having to take a far longer-term view and making sure the data structures that they implement are as flexible as possible; sensible Normalisation is a big step in that direction.
You should store the data in normalized form (option 3) unless you have a compelling reason not to. Non-normalized data as in option 1 and 2 have a number of drawbacks: risk of inconsistencies, difficulties in parsing and querying and so on.
You are making the system more complex and brittle, so you need a good reason to do so. Some legitimate reasons might be caching complex data structures for performance reasons or storing arbitrary data which is opaque to the application.
But you don't mention any particular reasons justifying non-normalized data, so go with 3.
(You suggest a drawback of option 3 is more tables and more joins. But this is a red herring, since you have the same amount of data and underlying structure with the other options, it is just stored in a less convenient and optimized way.)
The first antipattern that Bill Karwin discusses in his book "SQL Antipatterns" is Jaywalking (because an intersection is avoided). The antipattern specifically lists out the problems associated with storing comma separated values when you need to use them separately (for instance, querying the table becomes complicated). The solution offered in the book chapter is straight forward - create an intersection table.
Explore related questions
See similar questions with these tags.