Let's pretend I'm making a basic twitter clone. We could imagine our database is set up as the following:
1. Table 1 - contains user info {username, password}
2. Table 2 - contains session info {username, session_id, expire}
3. Table 3 - contains post info {username, post_content, number_of_likes}
These tables are basically grouped by function. However username
is shared between all of them. If a user was to change their username, this would be very hard to maintain since it would have to update across 3 tables. Is there a suggested way to organize user data here? E.g. one giant table? (bleh) or perhaps a way to centrally reference username so that changing the value once would update all dependent cells? I am using postgresql and am new to the language and databases.
-
2Read up on "database normalization". That's basically what you're asking about here.GrandmasterB– GrandmasterB2022年11月18日 04:23:38 +00:00Commented Nov 18, 2022 at 4:23
1 Answer 1
The typical solution would be to introduce a surrogate key, i.e. some ID number that's just used within the database. Then we might have a schema:
CREATE TABLE users ( userid bigserial PRIMARY KEY, -- will autoincrement username text NOT NULL, password blob NOT NULL ); CREATE TABLE sessions ( session_id int NOT NULL, userid int REFERENCES users(userid), expire timestamp NOT NULL ); CREATE TABLE posts ( postid int PRIMARY KEY, userid int REFERENCES users(userid), content text NOT NULL, likes int NOT NULL DEFAULT 0 );
But you don't need it. You can still use the username as a natural key. You are concerned that
If a user was to change their username, this would be very hard to maintain since it would have to update across 3 tables.
But such an update would be fairly easy. You could explicitly create a TRANSACTION
consisting of multiple queries, one updating each table.
BEGIN TRANSACTION;
UPDATE users ...;
UPDATE sessions ...;
UPDATE posts ...;
COMMIT;
Or, more realistically, we could use a REFERENCES
clause in the table definition to track updates. You already came up with this solution yourself:
or perhaps a way to centrally reference username so that changing the value once would update all dependent cells?
In our table definition, we'd can CASCADE
such changes with a column definition like
CREATE TABLE posts ( username text REFERENCES users(username) ON DELETE CASCADE ON UPDATE CASCADE, ... );
Then, you can DELETE
all of one user's data just by deleting their entry in the users
table, or update their username in all tables just by updating it in the users
table.
The point here is that RDBMS like Postgres are really mature systems with lots of features. These features generally help with keeping the data consistent – you can add constraints like the REFERENCES
clause to ensure that no entries in the other tables can exist without a matching row in the users
table, but you can also use such features to automatically update all of the dependent data.