Duplicating a key in a mapping table instead of joining

Question 1

I am overviewing a structure where a mapping table of student enrollments is done. I am wondering whether an additional column (class_category_id) to the mapping table would become a problem or a noticeable advantage. That additional key would be used to filter very often.

Here's a simplified structure of the database:

Class categories

id name
1 Math
2 Science

Classes

id name class_category_id
1 M101 1
2 M102 1
3 B101 2
4 P101 2

Student enrollments

id student_id class_id *class_category_id*
1 1001 1 1
2 1002 1 1
3 1003 3 2
4 1004 4 2

Common queries would include filtering the enrollments by the class category, without the actual need to get the class information itself.

It is not 100% clear whether class_category_id could have any advantages and disadvantages.

An important note would be that a category for a class will never change so updating multiple tables to update that would never be needed.

EDIT: Small note, this real table structure would be equivalent to this but with many more columns (for the non-mapping tables) and not really in any way related to classes/students.

Question 2

Things that never change do have a tendency to change surprisingly often. Have you considered select id, student_id from studentenrollment where class_id in (select id from class where class_category_id = 2) as a filtering method?

Question 3

Adding that column would be a violation of second normal form. Under 2NF, all attributes (non-key columns) in the table must be attributes of the entire primary key, not just part of the key. In your case, the category of the class is an attribute of class, not enrollment.

That being said, it is not completely uncommon to denormalize tables for performance reasons. If you think this change will be a huge benefit then it is not necessarily "evil" to do it.

For you to think about, here are some of the problems that could come up with this sort of change.

The size of the database will be larger, since you are duplicating data in two places
The amount of I/O required to retrieve data may be longer on average, because your working set will be larger and because fewer rows will fit on a data page. This can affect performance.
If you decide to index the class table by the category ID, any queries that use the enrollment table will not benefit from this index. You would need a separate index, which will consume more space and decrease the performance of any insert operation on enrollment.
Someone could make a mistake and put a different category ID in the enrollment vs. class.
Presumably, class may end up being static and getting entirely cached, so it may not be very expensive to join to it and retrieve the class category ID. On the other hand, enrollment will be much larger and in constant flux, so it will not benefit from any caching. Again, this could affect performance.
If classes are re-assigned to different categories (e.g. if the Greek Language department closes and all of its classes are moved to the Ancient Languages department) you will have data cleanup to do.
If category structure changes (e.g. some day they decide a class could belong to two or more categories) then your table structure will not be forward compatible with the 1:M relationship.

If you are simply doing all this because you don't want to type JOIN all the time, consider creating a View instead. You can pre-join as many tables as you want in the view, then use the view in your FROM clause instead.

John Wu John Wu 27k10 gold badges69 silver badges93 bronze badges · Accepted Answer · 2017-02-21 02:19:14Z

Adding that column would be a violation of second normal form. Under 2NF, all attributes (non-key columns) in the table must be attributes of the entire primary key, not just part of the key. In your case, the category of the class is an attribute of class, not enrollment.

That being said, it is not completely uncommon to denormalize tables for performance reasons. If you think this change will be a huge benefit then it is not necessarily "evil" to do it.

For you to think about, here are some of the problems that could come up with this sort of change.

The size of the database will be larger, since you are duplicating data in two places
The amount of I/O required to retrieve data may be longer on average, because your working set will be larger and because fewer rows will fit on a data page. This can affect performance.
If you decide to index the class table by the category ID, any queries that use the enrollment table will not benefit from this index. You would need a separate index, which will consume more space and decrease the performance of any insert operation on enrollment.
Someone could make a mistake and put a different category ID in the enrollment vs. class.
Presumably, class may end up being static and getting entirely cached, so it may not be very expensive to join to it and retrieve the class category ID. On the other hand, enrollment will be much larger and in constant flux, so it will not benefit from any caching. Again, this could affect performance.
If classes are re-assigned to different categories (e.g. if the Greek Language department closes and all of its classes are moved to the Ancient Languages department) you will have data cleanup to do.
If category structure changes (e.g. some day they decide a class could belong to two or more categories) then your table structure will not be forward compatible with the 1:M relationship.

If you are simply doing all this because you don't want to type JOIN all the time, consider creating a View instead. You can pre-join as many tables as you want in the view, then use the view in your FROM clause instead.

Stack Exchange Network

Duplicating a key in a mapping table instead of joining

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Duplicating a key in a mapping table instead of joining

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions