I've got some duplicated rows on a table called ja_jobs:
To find those duplicated data, I'm running this Query:
select * from ja_jobs WHERE clientid = 33731 AND creatortype = 'legacyrec' AND deleted = false AND time_job IS NOT NULL AND (time_job,recurrenceid) IN (
select time_job,recurrenceid FROM ja_jobs WHERE clientid = 33731 GROUP BY time_job,recurrenceid HAVING count(*) > 1
)
The query finds duplicated rows by time_job and recurrenceid.
On the following example:
You can see that the jobs are duplicated, but we have three versions of it (Just look on the modified_date column)
I need to delete the new jobs and only keep the OLDEST one.
DELETE from ja_jobs WHERE id IN (14754912,14792799);
How can I do that? How can I select all the newest jobs and just delete them?
Here what I've got so far:
select min(id) over (partition by time_job,recurrenceid,time_arrival order by created_date) as min_id into junk.test_table FROM ja_jobs
WHERE clientid = 33731 AND creatortype = 'legacyrec' AND deleted = false AND (time_job,recurrenceid) IN (
select time_job,recurrenceid FROM ja_jobs WHERE clientid = 33731 GROUP BY time_job,recurrenceid HAVING count(*) > 1
)
But on the junk.test_table table I got duplicated "min_id"
1 Answer 1
You're mixing grouping criteria twice while creating junk_test, first in the GROUP BY subselect by having fewer conditions in the WHERE, and then in the PARTITION BY by having one extra partitioning field (time arrival
).
If you can assume that older ids are older jobs, then you can identify your duplicates by joining grouped table with itself, like this:
SELECT jd.dup_group_no, j.id=jd.id AS to_keep, j.id INTO junk.test_table
FROM (
SELECT time_job, recurrenceid, client, creatortype, deleted, MIN(id) AS id, row_number() over () AS dup_group_no
FROM ja_jobs
WHERE clientid = 33731 AND creatortype = 'legacyrec' AND deleted = false
GROUP BY time_job, recurrenceid, client, creatortype, deleted
HAVING count(*) > 1
) jd
JOIN ja_jobs j USING (time_job, recurrenceid, clientid, creatortype, deleted);
If there is no guaranteed correlation between older ids and older jobs by creation time, the query is trickier:
SELECT jm.dup_group_no, j.id=jd.id AS to_keep, j.id INTO junk.test_table
FROM (
SELECT DISTINCT ON (jd.time_job, jd.recurrenceid, jd.clientid, jd.creatortype, jd.deleted) jd.time_job, jd.recurrenceid, jd.clientid, jd.creatortype, jd.deleted, jd.id, row_number() over () AS dup_group_no
FROM (
SELECT time_job, recurrenceid, client, creatortype, deleted
FROM ja_jobs
WHERE clientid = 33731 AND creatortype = 'legacyrec' AND deleted = false
GROUP BY time_job, recurrenceid, client, creatortype, deleted
HAVING count(*) > 1
) jd
JOIN ja_jobs jm USING (time_job, recurrenceid, clientid, creatortype, deleted)
ORDER BY jd.time_job, jd.recurrenceid, jd.clientid, jd.creatortype, jd.deleted, jd.created_date, jd.id
) jm
JOIN ja_jobs j USING (time_job, recurrenceid, clientid, creatortype, deleted);