3

I have the following table (PostgreSQL 9.6 but could upgrade):

# \d annotation_annotation 
 Table "public.annotation_annotation"
 Column | Type | Modifiers 
-------------------+-------------------------+--------------------------------------------------------------------
 id | bigint | not null default nextval('annotation_annotation_id_seq'::regclass)
 release | character varying(32) | not null
 chromosome | character varying(32) | not null
 position | integer | not null
 reference | character varying(512) | not null
 alternative | character varying(512) | not null
 database | character varying(8) | 
 effect | character varying(64)[] | not null
 gene_id | character varying(64) | 
 transcript_id | character varying(64) | 
 transcript_coding | boolean | 
 hgvs_c | character varying(512) | 
 hgvs_p | character varying(512) | 
Indexes:
 "annotation_annotation_pkey" PRIMARY KEY, btree (id)
 "annotation_annotation_release_chromosome_posit_d102b713_uniq" UNIQUE CONSTRAINT, btree (release, chromosome, "position", reference, alternative, transcript_id)
 "annotation__release_3f254e_idx" btree (release, chromosome, "position", reference, alternative, gene_id)

I'm running bulk insertions with ON CONFLICT IGNORE of tens of millions of files into the table. I'm expecting a high number of duplicates (>90%). The table is append-only. Using pg_activity, I'm seeing that PostgreSQL is spending more time reading than writing (as I would expect if the number of duplicates is high).

What would be a good way of tuning "upsert" performance here:

  • Would partitioning with PostgreSQL 11 help? I could provide an additional integer "chromosome_no" column that could be used for hashed partitioning.
  • Is it possible to use bloom indices to speedup "upsert" performance with unique constraints? (EDIT I realize that bloom filters will not help because they err into the wrong direction)
asked May 7, 2019 at 16:53

1 Answer 1

3

Partitioning the table would only help if the data in the files you are trying to insert are also partitioned using the same scheme.

If you sort the data to be inserted into the same order as the unique index is defined on, that should help if the bottleneck is reading from the index. But if the table is not clustered in the same order as the index, and the bottleneck is reading from the table, this might not help much.

For this type of task, I would usually deduplicate outside the database. You might also look into using something like file_fdw to map the new data into the database, and then query and insert for new data.

answered May 7, 2019 at 18:14

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.