Database design for mapping users to questions

Question 1

I'm currently building a quiz app and I need to design a database schema for it.

This is what I've tried so far: ERD Diagram

The biggest problem I have is with how to represent which questions have been already answered correctly by each user and which haven't.

In the current design the answered_questions table would grow very fast (assuming a pool of 2000 questions and 1million users it would be 2 billion entries, which doesn't seem to be a scalable approach). I thought of partitioning the answered_questions table by user_id but I don't if that's the most optimal solution.

Please note that I need to keep the information about which question is already answered because for each request i will have to retrieve the following information:

5 questions that have been correctly answered
5 questions that have not been correctly answered

filtered by the category, subcategory and user_id.

How could I redesign it to avoid having to query the answered_questions table which can become very large over time?

Any help is very much appreciated.

Question 2

"it would be 2 billion entries, which doesn't seem to be a scalable approach...I thought of partitioning the answered_questions table by user_id" - Size of data at rest (total size of the table) is irrelevant to scalability. A 2 billion rows table is just as queryable as a 200 row table, depending on how much data you're actually asking for out of the table. Partitioning isn't meant for improving DQL query performance, that's a common fallacy. Proper indexing will "partition" your data in a much more efficient manner.

Question 3

@J.D. so in my case I know that I'll be often querying the answered_questions table (2 billion rows) by the user_id field, then always perform the JOIN operation with the questions table on question_id and then filter by category and subcategory to return 10 questions. Given that, would creating an index on question_id and user_id be reasonable?

Question 4

Usually seeing the exact query is important, but I understand you, and yes generally speaking that's probably an index worth having.

Question 5

Correction @J.D. , partitioning DOES improve DQL queries as per PostgreSQL documentaiton postgresql.org/docs/current/ddl-partitioning.html. It's not a common fallacy. There are circumstances where it wouldn't be beneficial of course (there is no silver bullet after all), but it does improve performance (potentially drastically) when fetching data that correspond to one or few partitions. @ Natural Unintelligence, You may read the documentation in detail further.

Question 6

@Chessbrain Disagree. Partitioning isn't meant for improving DQL performance (regardless if it does in some specific cases). Indexing is. The docs overplay it and don't even specifically state it's a replacement or better than indexing. By the same logic, I could say deleting all your data is meant to improve DQL performance. "but it does improve performance (potentially drastically) when fetching data that correspond to one or few partitions" - That's true for indexes too, since that scenario would elicit a seek operation. Partitioning can improve data management performance though.

Question 7

I think that you should modify your design so that answered_questions only contains rows for questions that were answered. In other words, do away with the boolean answered column and don't store the rows where it would be FALSE.

I doubt that all 2000 users will answer one million questions, so that should reduce the size of the table considerably.

Question 8

Regarding table questions: Having correct_answers as TEXT isn't practical.

If it's a multiple choice test...

Then each question has several pre-defined answers that the user can pick. Are you going to put an array of answers into that TEXT column? It would be more practical to have an answers table like so:

answer_id PK
question_id FK into questions
answer_is_correct BOOL
answer_score INT if you want to assign scores to answers
answer_order INT

If you want to order answers in the display and especially change the order, then you should never do that by changing the primary key because that will break references, so an extra column with an order number can be useful. Without that they will come out of the select in undefined order, which isn't necessarily a problem in this case.

Table answered_questions should contain at least:

user_id FK into users
answer_id FK into answers

However that makes it difficult to enforce constraints like "one row per user and per question", so we need to add question_id:

user_id FK into users
question_id INT NOT NULL
answer_id INT NOT NULL
(question_id,answer_id) REFERENCES answers(question_id,answer_id)
PRIMARY KEY (user_id,question_id,answer_id)

The foreign key enforces integrity and the PK ensures the table can have at most one answer per question for each user. It also provides a useful index.

If the user has not answered this question, then the absence of the row with (user_id,question_id) is enough information. There is no need to add a bool to record whether the question was answered or not, and clog the table with lots of rows that serve no purpose. As a general rule, rows with a bool column that really says "this row shouldn't be there"... shouldn't be there.

The biggest problem I have is with how to represent which questions have been already answered correctly by each user and which haven't.

You can replicate the answer_is_correct column from the answers table into the answered_questions table, but that comes with a gotcha: since it's not part of the PK in answers, you can't really put a FK on it, so if you want to update it in table answers, it won't be replicated to table answers_questions. If that's okay with you then the advantage of duplicating the column is you can put an index on it (user_id,question_id,answer_is_correct) to speed up selects.

Personally I'd use another solution, and not duplicate the columns. In order to get the information you want you can use a select like so:

SELECT 
 q.*, 
 a.answer_id,
 a.is_correct
FROM questions q
LEFT JOIN answered_questions aq ON (aq.question_id=q.question_id AND aq.user_id=...)
LEFT JOIN answers a ON (a.answer_id = aq.answer_id)

You could add other conditions like question category in WHERE clause. For each (user_id,question_id) table answered_questions either contains one row or no rows, so if the question was not answered, due to LEFT JOIN, a.answer_id will be NULL. In fact you could only look at a.is_correct in the result set: NULL if the question was not answered, true/false if it was answered correctly/incorrectly.

This scans all questions, there's no other choice really. Thankfully answered_questions has a multicolumn index on (user_id,question_id,answer_id) so postgres will use a fast index only scan. The index is clustered in the proper way, order of columns is important here, we want rows with the same (user_id,question_id) to be on adjacent pages for speed. Especially putting user_id first is important as this lets pages for users who are not online at the moment fall out of cache without causing trouble.

In any case this is not the type of query you do on each page: if it's a website, you do it once per user at login, get the final results you want, then cache that in the session, and only update when needed or maybe at most every minute.

Laurenz Albe Laurenz Albe 62.1k4 gold badges57 silver badges93 bronze badges · Answer 1 · 2023-10-30 04:11:40Z

I think that you should modify your design so that answered_questions only contains rows for questions that were answered. In other words, do away with the boolean answered column and don't store the rows where it would be FALSE.

I doubt that all 2000 users will answer one million questions, so that should reduce the size of the table considerably.

bobflux bobflux 1,7761 gold badge9 silver badges7 bronze badges · Answer 2 · 2023-11-03 15:18:32Z

Regarding table questions: Having correct_answers as TEXT isn't practical.

If it's a multiple choice test...

Then each question has several pre-defined answers that the user can pick. Are you going to put an array of answers into that TEXT column? It would be more practical to have an answers table like so:

answer_id PK
question_id FK into questions
answer_is_correct BOOL
answer_score INT if you want to assign scores to answers
answer_order INT

If you want to order answers in the display and especially change the order, then you should never do that by changing the primary key because that will break references, so an extra column with an order number can be useful. Without that they will come out of the select in undefined order, which isn't necessarily a problem in this case.

Table answered_questions should contain at least:

user_id FK into users
answer_id FK into answers

However that makes it difficult to enforce constraints like "one row per user and per question", so we need to add question_id:

user_id FK into users
question_id INT NOT NULL
answer_id INT NOT NULL
(question_id,answer_id) REFERENCES answers(question_id,answer_id)
PRIMARY KEY (user_id,question_id,answer_id)

The foreign key enforces integrity and the PK ensures the table can have at most one answer per question for each user. It also provides a useful index.

If the user has not answered this question, then the absence of the row with (user_id,question_id) is enough information. There is no need to add a bool to record whether the question was answered or not, and clog the table with lots of rows that serve no purpose. As a general rule, rows with a bool column that really says "this row shouldn't be there"... shouldn't be there.

The biggest problem I have is with how to represent which questions have been already answered correctly by each user and which haven't.

You can replicate the answer_is_correct column from the answers table into the answered_questions table, but that comes with a gotcha: since it's not part of the PK in answers, you can't really put a FK on it, so if you want to update it in table answers, it won't be replicated to table answers_questions. If that's okay with you then the advantage of duplicating the column is you can put an index on it (user_id,question_id,answer_is_correct) to speed up selects.

Personally I'd use another solution, and not duplicate the columns. In order to get the information you want you can use a select like so:

SELECT 
 q.*, 
 a.answer_id,
 a.is_correct
FROM questions q
LEFT JOIN answered_questions aq ON (aq.question_id=q.question_id AND aq.user_id=...)
LEFT JOIN answers a ON (a.answer_id = aq.answer_id)

You could add other conditions like question category in WHERE clause. For each (user_id,question_id) table answered_questions either contains one row or no rows, so if the question was not answered, due to LEFT JOIN, a.answer_id will be NULL. In fact you could only look at a.is_correct in the result set: NULL if the question was not answered, true/false if it was answered correctly/incorrectly.

This scans all questions, there's no other choice really. Thankfully answered_questions has a multicolumn index on (user_id,question_id,answer_id) so postgres will use a fast index only scan. The index is clustered in the proper way, order of columns is important here, we want rows with the same (user_id,question_id) to be on adjacent pages for speed. Especially putting user_id first is important as this lets pages for users who are not online at the moment fall out of cache without causing trouble.

In any case this is not the type of query you do on each page: if it's a website, you do it once per user at login, get the final results you want, then cache that in the session, and only update when needed or maybe at most every minute.

Stack Exchange Network

Database design for mapping users to questions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Database design for mapping users to questions

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions