I am working on a personal project where I need to be able to store and retrieve game statistics for a bunch of players and support very fast lookup on each player id. My current design (with unnecessary details omitted) looks something along the lines of
games: player1's score > player2's score > player3's score > player4's score
+--------------------------+--------------------+--------------------+--------------------+--------------------+
| id BIGSERIAL PRIMARY KEY | player1 VARCHAR(8) | player2 VARCHAR(8) | player3 VARCHAR(8) | player4 VARCHAR(8) |
+--------------------------+--------------------+--------------------+--------------------+--------------------+
| 1 | playerA | playerB | playerC | playerD |
+--------------------------+--------------------+--------------------+--------------------+--------------------+
| 1 | playerE | playerF | playerE | playerA |
+--------------------------+--------------------+--------------------+--------------------+--------------------+
| 1 | playerF | playerB | playerC | playerE |
+--------------------------+--------------------+--------------------+--------------------+--------------------+
player_games:
+-----------------------+-------------------+------------------+
| id SERIAL PRIMARY KEY | player VARCHAR(8) | gameid BIGSERIAL |
+-----------------------+-------------------+------------------+
| 1 | asdf | 1 |
+-----------------------+-------------------+------------------+
| 2 | asdf | 2 |
+-----------------------+-------------------+------------------+
| 3 | fdsa | 1 |
+-----------------------+-------------------+------------------+
| ... | ... | ... |
+-----------------------+-------------------+------------------+
and I will do a player lookup along the lines of
SELECT * FROM games WHERE id IN (SELECT gameid FROM player_games WHERE player='<player>')
Since I will be inserting tens of thousands of games per day, I am looking for ways to efficiently store data in player_games
. The other alternative I am considering is to use an array, so instead we will have something along the lines of
player_games:
+-------------------------------+---------------------+
| player VARCHAR(8) PRIMARY KEY | gameids BIGSERIAL[] |
+-------------------------------+---------------------+
and I will do a lookup with
SELECT * FROM games WHERE id IN unnest(SELECT gameids FROM player_games WHERE player='<player>')
Which option is the better option here, and in the case of the first, is it beneficial to have an index on the player
column? I will be batch inserting roughly 4000
rows per hour (90000
rows per day) into player_games
after populating the historical data.
-
I don't understand the design of the games table at the top. Can you give a sample row corresponding to the rows in player_games. How many players participate in a game?Lennart - Slava Ukraini– Lennart - Slava Ukraini2018年06月12日 20:37:54 +00:00Commented Jun 12, 2018 at 20:37
-
3 or 4 players participate per game and have an omitted score column per player. I have also omitted some game metadata.incertia– incertia2018年06月12日 20:41:12 +00:00Commented Jun 12, 2018 at 20:41
2 Answers 2
Given the information in the question I would start out with something like:
CREATE TABLE players
( player char(8) not null primary key
, additional attributes );
CREATE TABLE games
( game_id int not null primary key
, additional attributes );
CREATE TABLE player_games (match?)
( player char(8) not null
references players (player)
, game_id int not null
references games (game_id)
, participant_no smallint not null
, constraint ... check (participant_no between 1 and 4)
, primary key (player, game_id)
, unique (game_id, participant_no) )
CREATE TABLE results
( game_id int not null
, player char(8) not null
, score ... not null
, foreign key (game_id, player)
references player_games (game_id, player)
Example of queries that can easily be answered
Which games has a player participated in?
SELECT game_id
FROM player_games
WHERE player = ?
JOIN
with games if you need more info from each game
Which players participated in a game?
SELECT player
FROM player_games
WHERE game_id = ?
JOIN
with players if you need more info from each player
Order the players from game X according to there score:
SELECT player, score
FROM results
WHERE game_id = X
ORDER BY score
SELECT *
FROM games
WHERE id IN (SELECT gameid FROM player_games WHERE player='<player>')
Just rewrite that as a JOIN
SELECT games.*
FROM games AS g
JOIN player_games AG pg ON pg.gameid = g.id
WHERE player = 1ドル;
If you haven't implemented any of this stuff, as a minor note using id
is really an anti-pattern and should be avoided as a naming convention. JOIN
ing on an array is a bad idea. I would leave things the way they are. On the high side, how many games are people playing? Aggregating 10,000 games should be very fast.
Which option is the better option here, and in the case of the first, is it beneficial to have an index on the player column?
Yes, you need to index the foreign key and the column you're selecting on.
Also, do not use VARCHAR(8)
for a playerid. You should be using int
for an id, and something named like player_username
as a text
field. In PostgreSQL we rarely use varchar(8)
, as it offers nothing but a length check to slow down what should often times be an unrestricted column.
Even if you were using arrays, you shouldn't be using them like this,
SELECT *
FROM games
WHERE id IN unnest(
SELECT gameids
FROM player_games
WHERE player='<player>'
);
Instead, write something like this with the containment operator @>
SELECT *
FROM games AS g
JOIN player_games AS pg ON g.id=pg.gameids
WHERE pg.gameids @> g.id;
Or you can use ANY
,
SELECT *
FROM games AS g
JOIN player_games AS pg ON g.id=pg.gameids
WHERE g.id = ANY(pg.gameids);
Also may want to look into the intarray
extension. But, again I would never do this.
Update
I would fix the schema,
CREATE SCHEMA local;
CREATE TABLE local.players (
playerid int PRIMARY KEY GENERATED BY DEFAULT AS IDENTITY,
.. your profile and stuff, perhaps the login to your own system
);
CREATE SCHEMA thirdparty;
CREATE TABLE thirdparty.username_to_players (
playerid int REFERENCES local.players,
thirdparty_username text
);
This would make your query look like,
SELECT local.games.*
FROM local.games AS lg
-- playerid is your internal player id EVERYWHERE
JOIN player_games AG pg USING (playerid)
JOIN thirdparty.username_to_players AS tpup USING (playerid)
WHERE tpup.thirdparty_username = 1ドル;
I would still use text
there and not varchar(8)
because why not? If they send a 9 digit player id, you going to call up the provider on the phone and tell them they're breaking the spec -- and, if you do will they likely care? Not worth my time, and I don't particularly care if they lie to me about their schema, I just fast, working, reliable operations. As a third party provider, it's not your duty to ensure their data meets their claims about it.
-
1Thanks for the speedy response! Player ids in this case are just player names and the service I am pulling the logs from enforces them to be 8 characters maximum. Should I still be using
text
in this case?incertia– incertia2018年06月12日 20:38:25 +00:00Commented Jun 12, 2018 at 20:38 -
@incertia see the update.Evan Carroll– Evan Carroll2018年06月12日 20:59:53 +00:00Commented Jun 12, 2018 at 20:59