I want to create a table to store user ratings of items for collaborative filtering algorithms. So far my table looks like this:
IID | UID | Rating i.e.
item ID |user id |rating.
From collaborative algorithm examples the tables rather look like:
Item ID | Rating U1 | Rating U2 | ... | Rating Un
This means i would have to alter (extend) the table every time a new user signs up. Alternatively i have to create a new table for each user. Both does not seem to be very efficient solution. Can anyone point me into the right direction?
Thanks
-
Why you need so many columns for rating, Are you thinking of a rating column per user?vijayp– vijayp2014年09月07日 12:13:12 +00:00Commented Sep 7, 2014 at 12:13
-
To me a table with Item ID and User 1 ... User n seems the most efficient in terms of space. Otherwise you have lots of duplicates in the IID column, yet you would not have to create new columns for every user..cactus_amigo– cactus_amigo2014年09月07日 12:29:54 +00:00Commented Sep 7, 2014 at 12:29
2 Answers 2
The first design is normalized, the second design is in 0NF (un-normalized). You haven't stated any reason to denormalize and you should never denormalize without a good reason.
Here is why the second design is very bad:
Think about what your queries to calculate an average rating would look like in each case.
Think about what you're going to do when your number of users exceeds your DBMS' limit on the number of columns per table.
You stated in a comment that you are concerned about duplication of data because of foreign key values, however, how much space are you wasting by having a sparsely populated table? If you have 1,000 users and only 10 of them have rated each item on average, then you're wasting 990 bits of null flag per item.
Constantly changing your table width as users enroll can be a big performance hit because of having to move chunks of data around when the new column doesn't fit in the available free space.
You asked for best practice. Best practice for transactional data is third normal form (or higher, if applicable).
The first design is better design which is
item ID |user id |rating
Resons:-
For each user you don't need to change the schema of the table and hence indexes too.
Performance wise it offers better indexing options.
Tables with less number of columns easy to manage.
Easy DML operations.
Provides flexibility for future changes/requirement
For each user rating you need an insert query and selecting will be easy too.
The combination of UserId
and ItemId
will be unique and could be use to create constraints/index
Explore related questions
See similar questions with these tags.