PostgreSQL "upsert" tuning with unique constraint

Question 1

I have the following table (PostgreSQL 9.6 but could upgrade):

# \d annotation_annotation 
 Table "public.annotation_annotation"
 Column | Type | Modifiers 
-------------------+-------------------------+--------------------------------------------------------------------
 id | bigint | not null default nextval('annotation_annotation_id_seq'::regclass)
 release | character varying(32) | not null
 chromosome | character varying(32) | not null
 position | integer | not null
 reference | character varying(512) | not null
 alternative | character varying(512) | not null
 database | character varying(8) | 
 effect | character varying(64)[] | not null
 gene_id | character varying(64) | 
 transcript_id | character varying(64) | 
 transcript_coding | boolean | 
 hgvs_c | character varying(512) | 
 hgvs_p | character varying(512) | 
Indexes:
 "annotation_annotation_pkey" PRIMARY KEY, btree (id)
 "annotation_annotation_release_chromosome_posit_d102b713_uniq" UNIQUE CONSTRAINT, btree (release, chromosome, "position", reference, alternative, transcript_id)
 "annotation__release_3f254e_idx" btree (release, chromosome, "position", reference, alternative, gene_id)

I'm running bulk insertions with ON CONFLICT IGNORE of tens of millions of files into the table. I'm expecting a high number of duplicates (>90%). The table is append-only. Using pg_activity, I'm seeing that PostgreSQL is spending more time reading than writing (as I would expect if the number of duplicates is high).

What would be a good way of tuning "upsert" performance here:

Would partitioning with PostgreSQL 11 help? I could provide an additional integer "chromosome_no" column that could be used for hashed partitioning.
Is it possible to use bloom indices to speedup "upsert" performance with unique constraints? (EDIT I realize that bloom filters will not help because they err into the wrong direction)

Question 2

Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.

If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.

For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.

jjanes jjanes 42.4k3 gold badges44 silver badges54 bronze badges · Accepted Answer · 2019-05-07 18:14:39Z

Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.

If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.

For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.

Stack Exchange Network

PostgreSQL "upsert" tuning with unique constraint

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

PostgreSQL "upsert" tuning with unique constraint

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions