In Postgres, is it inefficient to perform an "INSERT... ON CONFLICT DO NOTHING" if >99.9% of the time the record exists and nothing is done?
E.g. let’s say I have a Node process which fetches 100k records which map to rows, of which 99,980 already exist. I could:
- Insert them all, but do nothing on uniqueness constraint violation
- Insert where not exists query
- Select all, do de-duplication "client" side, then insert only the 20 new ones
The first approach is easiest but I wonder if it will be too inefficient for Postgres. I have read that it increments serial numbers in such a "do nothing" case, but would it be OK if I skipped using a serial and used my unique field as the primary (string) key instead?
-
Could you perhaps show us your table structure and even some sample data?Vérace– Vérace2021年09月17日 09:33:41 +00:00Commented Sep 17, 2021 at 9:33
-
The two key fields are "ecosystem" and "package", which are unique together. Then there would be an "id" column which would either be an auto-incrementing serial, or a concatenation of ${ecosystem}-${package}.rgareth– rgareth2021年09月17日 14:08:46 +00:00Commented Sep 17, 2021 at 14:08
-
1Thanks for that - but it's best practice to put any relevant information into the question and then use the comments to inform the person who requested it. That way, the question remains the single source of truth. Also, there's a certain amount of cogntive dissonance in having to scroll down through the comments and then back up to the question itself. The actual table DDL would be great! :-)Vérace– Vérace2021年09月17日 14:13:44 +00:00Commented Sep 17, 2021 at 14:13
-
1So basically you have to compare 100k rows that are in the database against 100k rows that are client-side? So you have 100k rows that need to go through the network no matter what method you use?Daniel Vérité– Daniel Vérité2021年09月17日 14:28:15 +00:00Commented Sep 17, 2021 at 14:28
1 Answer 1
INSERT ... ON CONFLICT
is as efficient as it can be, but the question is beside the point.
It may well be that an UPDATE
is faster, but that UPDATE
wouldn't do the right thing for 0.1% of the rows, so it is not a solution.
Note: it is always possible to be faster if you don't have to be correct.