I have the following table (PostgreSQL 9.6 but could upgrade):
# \d annotation_annotation
Table "public.annotation_annotation"
Column | Type | Modifiers
-------------------+-------------------------+--------------------------------------------------------------------
id | bigint | not null default nextval('annotation_annotation_id_seq'::regclass)
release | character varying(32) | not null
chromosome | character varying(32) | not null
position | integer | not null
reference | character varying(512) | not null
alternative | character varying(512) | not null
database | character varying(8) |
effect | character varying(64)[] | not null
gene_id | character varying(64) |
transcript_id | character varying(64) |
transcript_coding | boolean |
hgvs_c | character varying(512) |
hgvs_p | character varying(512) |
Indexes:
"annotation_annotation_pkey" PRIMARY KEY, btree (id)
"annotation_annotation_release_chromosome_posit_d102b713_uniq" UNIQUE CONSTRAINT, btree (release, chromosome, "position", reference, alternative, transcript_id)
"annotation__release_3f254e_idx" btree (release, chromosome, "position", reference, alternative, gene_id)
I'm running bulk insertions with ON CONFLICT IGNORE
of tens of millions of files into the table. I'm expecting a high number of duplicates (>90%). The table is append-only. Using pg_activity
, I'm seeing that PostgreSQL is spending more time reading than writing (as I would expect if the number of duplicates is high).
What would be a good way of tuning "upsert" performance here:
- Would partitioning with PostgreSQL 11 help? I could provide an additional
integer
"chromosome_no" column that could be used for hashed partitioning. - Is it possible to use
bloom
indices to speedup "upsert" performance with unique constraints? (EDIT I realize that bloom filters will not help because they err into the wrong direction)
1 Answer 1
Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.
If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.
For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.