JSONB Array of Strings (with GIN index) versus Split Rows (B-Tree Index)

Question 1

I have a database which stores receiver to indicate which account the data relates to. This has led to tons of duplication of data, as one set of data may create 3 separate rows, where all column data is the same with the exception of the receiver column. While redesigning the database, I have considered using an array with a GIN index instead of the current B-Tree index on receiver.

Current table definition:

CREATE TABLE public.actions (
 global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass),
 time timestamp with time zone NOT NULL DEFAULT CURRENT_TIMESTAMP,
 receiver text NOT NULL,
 tx_id text NOT NULL,
 block_num integer NOT NULL,
 contract text NOT NULL,
 action text NOT NULL,
 data jsonb NOT NULL
);

Indexes:

"actions_pkey" PRIMARY KEY, btree (global_sequence, time)
"actions_time_idx" btree (time DESC)
"receiver_idx" btree (receiver)

Field details:

Global sequence is a serially incrementing ID
Block number and time are not unique, but also incrementing
Global sequence and time are primary key, as the data is internally partitioned by time
- There are some receivers that have over 1 billion associated actions (each with a unique global_sequence).
Average text lengths:
- Receiver: 12
- tx_id: 52
- contract: 12
- action: 6
- data: small-medium sized JSONB with action metadata

Cardinality of 3 schema options:

Current: sitting at 4.2 billion rows in this table
Receiver as array: Would be at approximately 1.8 billion rows
Normalized: There would be 3 tables:
- Actions: 1.8 billion rows
- Actions_Accounts: 4.2 billion rows
- Accounts: 500 000 rows

Common Query:

SELECT * FROM actions WHERE receiver = 'Alpha' ORDER BY time DESC LIMIT 100

All columns are required in the query. NULL values are not seen. I believe joins in the normalized schema would slow down & query speed is #1 priority)

Question 2

You added some good information. Some more is missing. Consider instructions here. NOT NULL or other constraints? (Actual CREATE TABLE scripts are the one true source of information.) Avg. text length, percentage of NULL values.The new information is out of sync with the old. Which is the set of duplicative columns now? Please make it consistent and move essential information from your comments to the question. And do you actually need SELECT * in your common query? Or can you reduce the SELECT list to a few columns?

Question 3

I have added all the requested information @ErwinBrandstetter

Question 4

Why PRIMARY KEY, btree (global_sequence, time)? Is global_sequence not UNIQUE? Seems it should be. And: one set of data may create 3 separate rows - knowing the exact story behind this might allow for more optimization. But that's probably a job for (paid) consultant.

Question 5

Its the limitation of our timeseries extension, which requires the partitioning key (time) to be part of the primary key. One set of data is simply creating 3 seperate rows at the moment to allow receiver column to be covered by a B-tree index, it does not need to.

Question 6

So global_sequence is unique after all?

Question 7

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
 receiver_id serial PRIMARY KEY
 , receiver text NOT NULL -- UNIQUE?
);
CREATE TABLE action ( -- I shortened the name to "action"
 action_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
 -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass), -- ??
 time timestamptz NOT NULL DEFAULT now(),
 block_num int NOT NULL,
 tx_id text NOT NULL,
 contract text NOT NULL,
 action text NOT NULL,
 data jsonb NOT NULL
)
CREATE TABLE receiver_action (
 receiver_id int REFERENCES receiver
 , action_id bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM receiver_action ra
JOIN action a USING (action_id)
WHERE ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER BY a.time DESC
LIMIT 100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
 receiver_id int REFERENCES receiver
 , action_id bigint REFERENCES action
 , time timestamptz NOT NULL DEFAULT now() -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM (
 SELECT action_id
 FROM receiver_action
 WHERE receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
 ORDER BY time DESC
 LIMIT 100
 )
JOIN action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

Question 8

The undisclosed ID is an incremented bigserial as you correctly assumed. Wouldn't joins on this table be slower than a GIN index? There are some receivers that have over 1 billion receiver_connections. Typically in the user dashboard, the common query is SELECT * FROM table WHERE Receiver = Alpha ORDER BY date DESC LIMIT 100

Question 9

I have also updated the question with full details about the schema. If there is anything else, let me know.

Question 10

@SyedJafri why assume a join would be slow? Databases were born to join. Try both ways and see for yourself, but anticipate a lot of difficulties down the line if you start adding arrays or json to a database as a substitute for basic normalisation.

Question 11

Yes, with a table this size, I will definitely be trying different schemas and the normalized table is one of them. The #1 priority at the moment is query speed.

Question 12

With all columns NOT NULL, my assumptions basically hold. I cleaned up my answer.

score 4 · Accepted Answer · 2019-02-04 18:23:20Z

The optimal DB design always depends on the complete picture.

Generally, there is hardly anything faster than a plain btree index for your simple query. Introducing json or jsonb or even a plain array type in combination with a GIN index will most likely make it slower.

With your original table this multicolumn index with the right sort order should be a game changer for your common query:

CREATE INDEX game_changer ON actions (receiver, time DESC);

This way, Postgres can just pick the top 100 rows from the index directly. Super fast.

Optimizing queries on a range of timestamps (two columns)

Your current indexes receiver_idx and actions_time_idx may lose their purpose.

Next to the perfect index, storage size is an important factor for big tables, so avoiding duplication may be the right idea. But that can be achieved in various ways. Have you considered good old normalization, yet?

CREATE TABLE receiver (
 receiver_id serial PRIMARY KEY
 , receiver text NOT NULL -- UNIQUE?
);
CREATE TABLE action ( -- I shortened the name to "action"
 action_id bigint GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
 -- global_sequence bigint NOT NULL DEFAULT nextval('actions_global_sequence_seq'::regclass), -- ??
 time timestamptz NOT NULL DEFAULT now(),
 block_num int NOT NULL,
 tx_id text NOT NULL,
 contract text NOT NULL,
 action text NOT NULL,
 data jsonb NOT NULL
)
CREATE TABLE receiver_action (
 receiver_id int REFERENCES receiver
 , action_id bigint REFERENCES action
 , PRIMARY KEY (receiver_id, action_id)
);

Also note the changed order of columns in table action, saves a couple of bytes per row, which makes a couple of GB for billions of rows.

See:

Your common query changes slightly to:

SELECT a.*
FROM receiver_action ra
JOIN action a USING (action_id)
WHERE ra. receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
ORDER BY a.time DESC
LIMIT 100;

Drawback: it's much harder to make your common query fast now. Related:

Can spatial index help a "range - order by - limit" query

The quick (and slightly dirty) fix: include the time column in table receiver_action redundantly (or move it there).

CREATE TABLE receiver_action (
 receiver_id int REFERENCES receiver
 , action_id bigint REFERENCES action
 , time timestamptz NOT NULL DEFAULT now() -- !
 , PRIMARY KEY (receiver_id, action_id)
);

Create an index:

CREATE INDEX game_changer ON receiver_action (receiver_id, time DESC) INCLUDE (action_id);

INCLUDE requires Postgres 11 or later. See:

Does a query with a primary key and foreign keys run faster than a query with just primary keys?

And use this query:

SELECT a.*
FROM (
 SELECT action_id
 FROM receiver_action
 WHERE receiver_id = (SELECT receiver_id FROM receiver WHERE receiver = 'Alpha')
 ORDER BY time DESC
 LIMIT 100
 )
JOIN action a USING (action_id);

Depending on the exact story behind one set of data may create 3 separate rows more may be possible - even 3 separate columns in table action instead of the n:m implementation and a expression GIN index ... But that's going in too deep. I tap out here.

The undisclosed ID is an incremented bigserial as you correctly assumed. Wouldn't joins on this table be slower than a GIN index? There are some receivers that have over 1 billion receiver_connections. Typically in the user dashboard, the common query is SELECT * FROM table WHERE Receiver = Alpha ORDER BY date DESC LIMIT 100
I have also updated the question with full details about the schema. If there is anything else, let me know.
@SyedJafri why assume a join would be slow? Databases were born to join. Try both ways and see for yourself, but anticipate a lot of difficulties down the line if you start adding arrays or json to a database as a substitute for basic normalisation.
Yes, with a table this size, I will definitely be trying different schemas and the normalized table is one of them. The #1 priority at the moment is query speed.
With all columns NOT NULL, my assumptions basically hold. I cleaned up my answer.

Stack Exchange Network

JSONB Array of Strings (with GIN index) versus Split Rows (B-Tree Index)

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions