0

We have an attribute-oriented data management system with bitemporality and flexible tenant priorization logic. Deletion of records is done by inserting the same row in DELETED status. Tenants may inherit data from other tenants and change/overwrite it in their own tenant, which won't affect the original data. The prioriation of tenants can be changed for each tenant. These two features mean we need to eliminate some data based on these priorities, and we solved it with a subquery that has a CASE statement in its join condition:

SELECT bd.INSTANCE_ID,
 bd.ATTRIBUTE_ID,
 bd.INSERTED,
 bd.MODIFIED_RECORD,
 bd.EFFECTIVE,
 bd.DISCONTINUE,
 bd.STATUS,
 bd.TENANT,
 bd.VALUE 
FROM MY_DATA bd
 WHERE 1 = 1 
 AND bd.INSERTED <= ? 
 AND bd.INSTANCE_ID IN (SELECT INSTANCE_ID 
 FROM MY_DATA 
 WHERE STATUS IN (?) 
 AND INSERTED <= ? 
 AND TENANT IN (?, ?) ) 
 AND ( bd.STATUS <> 'DELETED' ) 
 AND bd.ATTRIBUTE_ID IN (?, ?, ?, ?, ?, ?, ?, ?) 
 AND (bd.INSTANCE_ID, bd.ATTRIBUTE_ID, bd.INSERTED) 
 NOT IN (SELECT dc.INSTANCE_ID, dc.ATTRIBUTE_ID, dc.MODIFIED_RECORD 
 FROM MY_DATA dc 
 WHERE dc.STATUS = 'DELETED' 
 AND dc.MODIFIED_RECORD IS NOT NULL 
 AND dc.INSERTED <= ? 
 AND dc.TENANT IN (?, ?)
 AND CASE 
 WHEN bd.TENANT = 1 THEN 1 
 WHEN bd.TENANT = 2 THEN 3 
 ELSE -1 
 END <= CASE WHEN dc.TENANT = 1 THEN 1 WHEN dc.TENANT = 2 THEN 3 ELSE -1 END
 ) 
 AND bd.TENANT IN (?, ?)
ORDER BY bd.INSTANCE_ID asc, bd.ATTRIBUTE_ID asc, 
 CASE WHEN bd.TENANT = 1 THEN 1 WHEN bd.TENANT = 2 THEN 3 ELSE -1 END asc, 
 bd.EFFECTIVE asc, 
 bd.INSERTED asc

Now we have a test system with ~120000 rows in it, ~300 rows have DELETED status. All rows belong to the same tenant, so the CASE statement is not really necessary. But since we use generic query building, its presence is determined by entries in another table (the user-definable tenant priorities.) And since the feature is needed, optimizing the query to omit the clause would only be a workaround, not a solution.

When we execute this query on PostgreSQL, it is painfully slow, ~20minutes execution time. When we execute the same query on the same data on Oracle or MariaDB (installed on the same machine), it takes only a few seconds.

This is the result from PostgreSQL EXPLAIN (ANALYZE, BUFFERS):

QUERY PLAN
Sort (cost=133536462.09..133536522.26 rows=24068 width=125) (actual time=825446.289..825453.855 rows=47875 loops=1)
 Sort Key: bd.instance_id, bd.attribute_id, (CASE WHEN (bd.tenant = 1) THEN 1 WHEN (bd.tenant = 2) THEN 3 ELSE '-1'::integer END), bd.effective, bd.inserted
 Sort Method: external merge Disk: 4608kB
 Buffers: shared hit=87702007, temp read=576 written=577
 -> Nested Loop (cost=4675.23..133534710.56 rows=24068 width=125) (actual time=125.801..825062.707 rows=47875 loops=1)
 Buffers: shared hit=87702004
 -> HashAggregate (cost=4674.81..4787.10 rows=11229 width=8) (actual time=100.839..147.466 rows=11192 loops=1)
 Group Key: my_data.instance_id
 Batches: 1 Memory Usage: 1169kB
 Buffers: shared hit=1822
 -> Seq Scan on my_data (cost=0.00..4358.54 rows=126506 width=8) (actual time=0.025..69.096 rows=126525 loops=1)
 Filter: ((tenant = ANY ('{1,2}'::bigint[])) AND ((status)::text = 'RELEASED'::text) AND (inserted <= CURRENT_TIMESTAMP))
 Rows Removed by Filter: 302
 Buffers: shared hit=1822
 -> Index Scan using my_data_pkey on my_data bd (cost=0.42..11891.49 rows=2 width=121) (actual time=17.175..73.675 rows=4 loops=11192)
 Index Cond: ((instance_id = my_data.instance_id) AND ((attribute_id)::text = ANY ('{code,title,city,country,att1, att2}'::text[])) AND (inserted <= CURRENT_TIMESTAMP))
 Filter: (((status)::text <> 'DELETED'::text) AND (tenant = ANY ('{1,2}'::bigint[])) AND (NOT (SubPlan 1)))
 Rows Removed by Filter: 0
 Buffers: shared hit=87700182
 SubPlan 1
 -> Seq Scan on my_data dc (cost=0.00..5943.88 rows=1 width=25) (actual time=0.078..17.028 rows=301 loops=48177)
 Filter: ((modified_record IS NOT NULL) AND (tenant = ANY ('{1,2}'::bigint[])) AND ((status)::text = 'DELETED'::text) AND (inserted <= CURRENT_TIMESTAMP) AND (CASE WHEN (bd.tenant = 1) THEN 1 WHEN (bd.tenant = 2) THEN 3 ELSE '-1'::integer END <= CASE WHEN (tenant = 1) THEN 1 WHEN (tenant = 2) THEN 3 ELSE '-1'::integer END))
 Rows Removed by Filter: 126102
 Buffers: shared hit=87484942
Planning:
 Buffers: shared hit=108
Planning Time: 1.098 ms
Execution Time: 825476.536 ms

Table and index definition:

CREATE TABLE "public"."my_data"
(
 instance_id bigint NOT NULL,
 attribute_id varchar(100) NOT NULL,
 inserted timestamp NOT NULL,
 modified_record timestamp,
 effective date NOT NULL,
 discontinue date,
 status varchar(20) NOT NULL,
 tenant bigint NOT NULL,
 value varchar(4000),
 CONSTRAINT my_data_pkey PRIMARY KEY (instance_id,attribute_id,inserted)
)
;
CREATE INDEX my_data_iid ON "public"."my_data"(instance_id)
;
CREATE INDEX my_data_val ON "public"."my_data"
(
 tenant,
 attribute_id,
 value
)
;

My questions are:

  • why is the query so much slower on PostgreSQL than on Oracle or MariaDB?
  • what can we do to improve PostgreSQL performance? (since this is the DBMS we want to focus on)
asked Oct 13, 2021 at 11:25
0

2 Answers 2

1

To speed up the inner loop as much as possible, you could use this strange and very specialized index:

CREATE INDEX ON my_data (
 (CASE WHEN (tenant = 1) THEN 1 WHEN (tenant = 2) THEN 3 ELSE -1 END),
 inserted
)
WHERE modified_record IS NOT NULL
 AND tenant = ANY ('{1,2}'::bigint[])
 AND status = 'DELETED';
answered Oct 14, 2021 at 5:44
1

Changing NOT IN to NOT EXISTS as suggested by a_horse_with_no_name was the solution:

not in () in Postgres is not as efficiently optimized as in Oracle. Try rewriting that as a NOT EXISTS condition.

I replaced the subquery

AND (bd.INSTANCE_ID, bd.ATTRIBUTE_ID, bd.INSERTED) NOT IN (SELECT ... )

with

AND NOT exists 
 ( SELECT 1
 FROM my_data dc 
 WHERE dc.STATUS = 'DELETED' 
 AND dc.INSTANCE_ID = bd.INSTANCE_ID
 AND dc.ATTRIBUTE_ID = bd.ATTRIBUTE_ID
 AND dc.MODIFIED_RECORD = bd.INSERTED
 AND dc.INSERTED <= CURRENT_TIMESTAMP
 AND dc.TENANT IN (1,2)
 AND CASE WHEN bd.TENANT = 1 THEN 1 WHEN bd.TENANT = 2 THEN 3 ELSE -1 END <= CASE WHEN dc.TENANT = 1 THEN 1 WHEN dc.TENANT = 2 THEN 3 ELSE -1 END
 )

and with this change only the query that took ~20min now runs in fractions of a second.

Before that I tried the filtered index, also suggested by the same user:

You might also want to try a filtered index on my_data(tenant_id, inserted) where status = 'DELETED' and modified_record is not null to make the sub-query for the NOT IN () faster

That (without rephrasing the query) brought time down to ~15sec, so it is definitely something to keep in mind. But I tried the rephrased query on a test stage that did not have the index, and the NOT EXISTS alone was enough for the big speedup.

Paul White
95.3k30 gold badges439 silver badges689 bronze badges
answered Oct 14, 2021 at 7:17

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.