Delete duplicate records with no change in between

Question 1

I have a products table where I insert around 150,000 records a day. Most of them are redundant, but I need to keep them because of the new expiration date. I get product feeds from about 5 out of 30 vendors a day. Each vendor has around 35,000 unique products. No product can belong to more than one vendor.

CREATE TABLE vendor_prices (
 id serial PRIMARY KEY,
 vendor integer NOT NULL,
 sku character varying(25) NOT NULL,
 category_name character varying(100) NOT NULL,
 price numeric(8,5) NOT NULL,
 effective_date timestamp without time zone,
 expiration_date timestamp without time zone DEFAULT (now() + '1 year'::interval)
);

I'm trying to delete irrelevant records where there was no price change and it's no longer the last update for said product, eg.:

 effective_date price
 '2015-05-01' 1ドル.99 
 '2015-05-02' 1ドル.99 delete
 '2015-05-03' 1ドル.59 
 '2015-05-04' 1ドル.99 
 '2015-05-05' 1ドル.99 delete
 '2015-05-06' 1ドル.99 keep for new expiration date

So after each load (I figured it'll be easier for one vendor at a time) I want to do some kind of delete. Here is the long non-performing solution that I came up with.

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor integer)
 RETURNS integer AS
$BODY$
BEGIN
 -- Delete Redundant prices
 delete from vendor_prices
 where id in (
 select id from (
 select vp1.id, vp1.vendor, vp1.sku, vp1.price, vp1.effective_date, vp1.expiration_date
 from vendor_prices vp1 
 inner join (
 select vendor, sku, price from vendor_prices
 where vendor = _vendor
 group by vendor, sku, price 
 ) vp2
 on vp1.vendor = vp2.vendor and vp1.sku = vp2.sku and vp1.price = vp2.price
 where vp1.vendor = _vendor
 ) dupe
 -- fetch the irrelevant record
 WHERE (select a.effective_date from vendor_prices a
 where vendor = _vendor 
 and a.price = dupe.price and a.sku = dupe.sku and dupe.effective_date > a.effective_date
 -- but make sure there's no price change in-between(
 and (select b.effective_date from vendor_prices b 
 where vendor = _vendor 
 and b.sku = dupe.sku and b.effective_date < dupe.effective_date and b.effective_date > a.effective_date limit 1) IS NULL
 limit 1
 ) IS NOT NULL
 -- and that this is not the last update on said product, otherwise we'll keep it for expiration_date
 and ( select c.effective_date from vendor_prices c 
 where vendor = _vendor
 and c.sku = dupe.sku 
 and c.effective_date > dupe.effective_date limit 1
 ) IS NOT NULL
 ); 
 return 0;
END;
$BODY$
LANGUAGE plpgsql

This function ran for a few hours so I killed it. The table has around 5 million records. I tried all kinds of different indexes and combo indexes but nothing seems to help. There may be other inserts and deletes while I'm running this function.

Running PostgreSQL 9.3.4 on Solaris 11.2.
I have plenty of RAM and disk space.

Question 2

5M rows, ok. How many different vendors? How many different products? So how many rows per product (follows logically)? Did I get this right: 1 row per product per day? Also: there are two questions here. Make it one question. You can always ask the second one in a separate question if it's still relevant.

Question 3

Around 30 vendors and around 30/40,000 products. Whenever I get a feed from a vendor they include all their products

Question 4

And I'll remove the second query.

Question 5

40k products per vendor or total? How often is "Whenever"? So how many rows per product? And can one product have many vendors or does each belong to exactly one vendor? Please be clear and put all the relevant information into the question, not into comments.

Question 6

Also important: Do you have concurrent access? Concurrent write access? Plenty of RAM and disk space? Can you afford to create a new table? If the percentage of rows to delete is high enough, rewriting the table may be your best option. You need to provide clear information. The description reads:

I'm trying to delete irrelevant records where there was no price change and it's no longer the last update for said product.

But the function is restricted to a given vendor and an array of dates. Please make the explanation match the code.

Question 7

Core feature is the window function lag().
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
 RETURNS integer AS
$func$
DECLARE
 del_ct int;
BEGIN
 -- this may or may not be necessary:
 -- lock rows to avoid race conditions with concurrent deletes
 PERFORM 1
 FROM vendor_prices
 WHERE vendor = _vendor
 ORDER BY sku, effective_date, id -- guarantee row locks in consistent order
 FOR UPDATE;
 -- delete redundant prices
 DELETE FROM vendor_prices v
 USING (
 SELECT id
 , price = lag(price) OVER w -- same as last row
 AND (lead(id) OVER w) IS NOT NULL AS del -- not last row
 FROM vendor_prices
 WHERE vendor = _vendor
 WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
 ) d
 WHERE v.id = d.id
 AND d.del;
 GET DIAGNOSTICS del_ct = ROW_COUNT; -- optional:
 RETURN del_ct; -- return number of deleted rows
END
$func$ LANGUAGE plpgsql;

Call:

SELECT remove_vendor_price_dupes(1);

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Question 8

Thanks @Erwin, nice and short. one thing though, it fails to take care of this part: -- and that this is not the last update on said product, otherwise we'll keep it for expiration_date. So I added a condition

AND v.effective_date < (select effective_date from vendor_prices where vendor = _vendor and sku = v.sku order by effective_date desc limit 1)

Question 9

Also in terms of performance, it took more than 100 seconds to clean the table(for one vendor) the first run and after that around 2.5 seconds for each call. I guess my question is if this is the most efficient it can get or I'm still missing something.

Question 10

Actually, I take that back, subsequent calls take anywhere between 5 and 12 seconds depending on how many products for given vendor. Also without my added condition execution time is exactly half.

Question 11

@user1157575: added the missing condition in a cheaper fashion. Since we already operate on a window, using the same for another window function is very cheap: AND (lead(id) OVER w) IS NOT NULL. Consider the updated query. It should be substantially faster than what you had - compare apples to apples, so my function without explicit locking.

Question 12

performance increased by 50% on subsequent calls and 30% on initial call.

score 3 · Accepted Answer · 2015-05-18 23:23:39Z

Core feature is the window function lag().
Also pay special attention to avoid deadlocks and race conditions with concurrent deletes and inserts (which can affect which rows to delete!):

CREATE OR REPLACE FUNCTION remove_vendor_price_dupes(_vendor int)
 RETURNS integer AS
$func$
DECLARE
 del_ct int;
BEGIN
 -- this may or may not be necessary:
 -- lock rows to avoid race conditions with concurrent deletes
 PERFORM 1
 FROM vendor_prices
 WHERE vendor = _vendor
 ORDER BY sku, effective_date, id -- guarantee row locks in consistent order
 FOR UPDATE;
 -- delete redundant prices
 DELETE FROM vendor_prices v
 USING (
 SELECT id
 , price = lag(price) OVER w -- same as last row
 AND (lead(id) OVER w) IS NOT NULL AS del -- not last row
 FROM vendor_prices
 WHERE vendor = _vendor
 WINDOW w AS (PARTITION BY sku ORDER BY effective_date, id)
 ) d
 WHERE v.id = d.id
 AND d.del;
 GET DIAGNOSTICS del_ct = ROW_COUNT; -- optional:
 RETURN del_ct; -- return number of deleted rows
END
$func$ LANGUAGE plpgsql;

Call:

SELECT remove_vendor_price_dupes(1);

Notes

The current version of the 9.3 major release is 9.3.6. The project recommends that ...

all users run the latest available minor release for whatever major version is in use.
A multicolumn index on (vendor, sku, effective_date, id) would be perfect for this - in this particular order. But Postgres can combine indexes rather efficiently, too.
It might pay to add the otherwise irrelevant price as last item ot the index to get index-only scans out of this. You'll have to test.
Since you have concurrent deletes it may be a good idea to run a separate delete per vendor to reduce the potential for race conditions and deadlocks. Since there are only a few vendors, this seems like a reasonable partitioning. (Many tiny calls would be comparatively slow.)
I am running a separate SELECT (PERFORM in plpgsql, since we do not use the result) because the row locking clause FOR UPDATE cannot be used together with window functions. Don't let the keyword mislead you, this is not just for updates. I am locking all rows for the given vendor, since the result depends on all rows. Concurrent reads are not impaired, only concurrent writes have to wait until we are done. That's another reason why deleting rows for one vendor at a time in a separate transaction should be best.
sku is unique per product, so we can PARTITION BY it.
ORDER BY effective_date, id: your first version of the question included code for duplicate rows, so I added id to ORDER BY as additional tie breaker. This way it works for duplicates on (sku, effective_date) as well.
To preserve the last row for each set: AND (lead(id) OVER w) IS NOT NULL. Reusing the same window for lead() is cheap - independent of the added explicit WINDOW clause - that's just syntax shorthand for convenience.
I am locking rows in the same order: ORDER BY sku, effective_date, id. Make sure that concurrent DELETEs operate in the same order to avoid deadlocks. If all other transactions delete no more than a single row within the same transaction, there cannot be deadlocks and you don't need the row locking at all.
If concurrent INSERTs could lead to a different result (make different rows obsolete), you have to lock the whole table in EXCLUSIVE mode instead to avoid race conditions:
```
LOCK TABLE vendor_prices IN EXCLUSIVE MODE;
```
Do that only if it's necessary. It blocks all concurrent write access.
I am returning the number of rows deleted, but that's totally optional. You might as well return nothing and declare the function as RETURNS void.

Thanks @Erwin, nice and short. one thing though, it fails to take care of this part: -- and that this is not the last update on said product, otherwise we'll keep it for expiration_date. So I added a condition AND v.effective_date < (select effective_date from vendor_prices where vendor = _vendor and sku = v.sku order by effective_date desc limit 1)
Also in terms of performance, it took more than 100 seconds to clean the table(for one vendor) the first run and after that around 2.5 seconds for each call. I guess my question is if this is the most efficient it can get or I'm still missing something.
Actually, I take that back, subsequent calls take anywhere between 5 and 12 seconds depending on how many products for given vendor. Also without my added condition execution time is exactly half.
@user1157575: added the missing condition in a cheaper fashion. Since we already operate on a window, using the same for another window function is very cheap: AND (lead(id) OVER w) IS NOT NULL. Consider the updated query. It should be substantially faster than what you had - compare apples to apples, so my function without explicit locking.
performance increased by 50% on subsequent calls and 30% on initial call.

Stack Exchange Network

Delete duplicate records with no change in between

1 Answer 1

Notes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Delete duplicate records with no change in between

1 Answer 1

Notes

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions