I'm currently building a quiz app and I need to design a database schema for it.
This is what I've tried so far: ERD Diagram
The biggest problem I have is with how to represent which questions have been already answered correctly by each user and which haven't.
In the current design the answered_questions table would grow very fast (assuming a pool of 2000 questions and 1million users it would be 2 billion entries, which doesn't seem to be a scalable approach). I thought of partitioning the answered_questions table by user_id but I don't if that's the most optimal solution.
Please note that I need to keep the information about which question is already answered because for each request i will have to retrieve the following information:
- 5 questions that have been correctly answered
- 5 questions that have not been correctly answered
filtered by the category, subcategory and user_id.
How could I redesign it to avoid having to query the answered_questions table which can become very large over time?
Any help is very much appreciated.
2 Answers 2
I think that you should modify your design so that answered_questions
only contains rows for questions that were answered. In other words, do away with the boolean answered
column and don't store the rows where it would be FALSE
.
I doubt that all 2000 users will answer one million questions, so that should reduce the size of the table considerably.
Regarding table questions
: Having correct_answers
as TEXT isn't practical.
- If it's a multiple choice test...
Then each question has several pre-defined answers that the user can pick. Are you going to put an array of answers into that TEXT column? It would be more practical to have an answers
table like so:
answer_id PK
question_id FK into questions
answer_is_correct BOOL
answer_score INT if you want to assign scores to answers
answer_order INT
If you want to order answers in the display and especially change the order, then you should never do that by changing the primary key because that will break references, so an extra column with an order number can be useful. Without that they will come out of the select in undefined order, which isn't necessarily a problem in this case.
Table answered_questions
should contain at least:
user_id FK into users
answer_id FK into answers
However that makes it difficult to enforce constraints like "one row per user and per question", so we need to add question_id:
user_id FK into users
question_id INT NOT NULL
answer_id INT NOT NULL
(question_id,answer_id) REFERENCES answers(question_id,answer_id)
PRIMARY KEY (user_id,question_id,answer_id)
The foreign key enforces integrity and the PK ensures the table can have at most one answer per question for each user. It also provides a useful index.
If the user has not answered this question, then the absence of the row with (user_id,question_id) is enough information. There is no need to add a bool to record whether the question was answered or not, and clog the table with lots of rows that serve no purpose. As a general rule, rows with a bool column that really says "this row shouldn't be there"... shouldn't be there.
The biggest problem I have is with how to represent which questions have been already answered correctly by each user and which haven't.
You can replicate the answer_is_correct column from the answers table into the answered_questions table, but that comes with a gotcha: since it's not part of the PK in answers, you can't really put a FK on it, so if you want to update it in table answers, it won't be replicated to table answers_questions. If that's okay with you then the advantage of duplicating the column is you can put an index on it (user_id,question_id,answer_is_correct) to speed up selects.
Personally I'd use another solution, and not duplicate the columns. In order to get the information you want you can use a select like so:
SELECT
q.*,
a.answer_id,
a.is_correct
FROM questions q
LEFT JOIN answered_questions aq ON (aq.question_id=q.question_id AND aq.user_id=...)
LEFT JOIN answers a ON (a.answer_id = aq.answer_id)
You could add other conditions like question category in WHERE clause. For each (user_id,question_id) table answered_questions either contains one row or no rows, so if the question was not answered, due to LEFT JOIN, a.answer_id will be NULL. In fact you could only look at a.is_correct in the result set: NULL if the question was not answered, true/false if it was answered correctly/incorrectly.
This scans all questions, there's no other choice really. Thankfully answered_questions has a multicolumn index on (user_id,question_id,answer_id) so postgres will use a fast index only scan. The index is clustered in the proper way, order of columns is important here, we want rows with the same (user_id,question_id) to be on adjacent pages for speed. Especially putting user_id first is important as this lets pages for users who are not online at the moment fall out of cache without causing trouble.
In any case this is not the type of query you do on each page: if it's a website, you do it once per user at login, get the final results you want, then cache that in the session, and only update when needed or maybe at most every minute.
answered_questions
table byuser_id
" - Size of data at rest (total size of the table) is irrelevant to scalability. A 2 billion rows table is just as queryable as a 200 row table, depending on how much data you're actually asking for out of the table. Partitioning isn't meant for improving DQL query performance, that's a common fallacy. Proper indexing will "partition" your data in a much more efficient manner.