I am trying to report on duplicate records in a single table which has a unique key of app_cao_number. The duplicates occur if either: 1. The Passport field is duplicated; 2. The ID field is duplicated, or; 3. The Surname+FirstName are duplicated.
I can do this easily enough with three passes of the table using ORDER BY. But I am hoping to use a single SELECT statement, with subqueries, to do the job.
Starting with just finding duplicate IDs I have the following statement:
SELECT app_cao_number, app_id,
(SELECT app_id FROM people p2
WHERE p2.app_id IS NOT null
AND p2.app_id <> ''
AND p1.app_cao_number <> p2.app_cao_number
AND p1.app_id = p2.app_id
GROUP BY p2.app_id) AS DupId
FROM people p1
WHERE app_id IS NOT null
AND app_id <> ''
This appears to get me the results that I want, but also include rows that have a null DupId - despite my attempts to ignore blank and null values in the SELECT statement. Once this works I should be able to expand it to include the passport and name checks.
Please can someone explain why I have the following data output with nulls in the DupId column? Thank you.
Further: I thought it might be the GROUP BY clause, but I replaced it with a DISTINCT clause (below), but this gave the same result.
(SELECT DISTINCT p2.app_id FROM people p2
WHERE p2.app_id IS NOT null
AND p2.app_id <> ''
AND p1.app_cao_number <> p2.app_cao_number
AND p1.app_id = p2.app_id
) AS DupId
UPDATE
2 Answers 2
Look for the model - does you need something like this?
create table test (id int, value1 int, value2 int)
✓
insert into test values (1,11,21), (2,12,22), (3,13,23), (4,14,24), (5,12,24), (6,16,26), (7,17,24), (8,18,28)
8 rows affected
select t1.id id, t2.id dup_id, case when t1.value1 = t2.value1 then 'value 1' when t1.value2 = t2.value2 then 'value 2' else 'some error' end dup_field, case when t1.value1 = t2.value1 then t1.value1 :: text when t1.value2 = t2.value2 then t1.value2 :: text else 'some error' end dup_value from test t1, test t2 where t1.id < t2.id and ( t1.value1 = t2.value1 or t1.value2 = t2.value2 )
id | dup_id | dup_field | dup_value -: | -----: | :-------- | :-------- 2 | 5 | value 1 | 12 4 | 5 | value 2 | 24 4 | 7 | value 2 | 24 5 | 7 | value 2 | 24
-
Thanks Akina. I see what you mean about fiddle now - I've never used/seen that before. I'll try to use it in future. I've tried your solution - but cut it down to simplify it - and it still produces a lot of columns that are empty. Here is my code:
select t1.app_id id, t2.app_id dup_id, case when t1.app_id = t2.app_id then 'value 1' else 'some error' end dup_field from sa_appl_contacts t1, sa_appl_contacts t2 where t1.app_cao_number < t2.app_cao_number and ( t1.app_id = t2.app_id)
Paul Pritchard– Paul Pritchard2019年12月06日 08:13:39 +00:00Commented Dec 6, 2019 at 8:13 -
@PaulPritchard Create a fiddle with YOUR sample data and add the link into your question text accompanied by desured result for that data. Here is my code If you need to search for duplicates in one field only then you do not need in CASE at all... it still produces a lot of columns that are empty Your code cannot give NULL in any field. While creating fiddle try to use example data which reproduces empty fields output for your query.Akina– Akina2019年12月06日 08:43:11 +00:00Commented Dec 6, 2019 at 8:43
-
Hi @Akina, here is my fiddle: dbfiddle.uk/… This now produces 'twice' the result I'm looking for. I haven't worried about the nulls here yet, I just want to get the output right.Paul Pritchard– Paul Pritchard2019年12月06日 09:11:43 +00:00Commented Dec 6, 2019 at 9:11
-
on further looking at my example I see that it is adding a row for each duplicate now. I just need a single row.Paul Pritchard– Paul Pritchard2019年12月06日 09:18:17 +00:00Commented Dec 6, 2019 at 9:18
-
@PaulPritchard here is my fiddle It seems that you "want strange". Why you output
t1.passport ppt, t2.passport dup_ppt
when you set thatt1.passport = t2.passport
in WHERE??? They're always equal... This now produces 'twice' the result I'm looking for. Be more precise - WHAT value in WHAT field do you tell about?Akina– Akina2019年12月06日 09:28:57 +00:00Commented Dec 6, 2019 at 9:28
If the subquery in the SELECT
list returns no result, you will get a NULL value. You seem to expect that that would result in that result row from being excluded, but that is not the case.
What about a simple query like
SELECT app_id, count(app_cao_number)
FROM people
GROUP BY app_id HAVING count(app_cao_number) > 1;
-
Hi again @Laurenz. You helped me before recently. This is such a simple, elegant answer and works perfectly for the duplicate app_id problem. Thank you. However. Extending this to find duplicates in the password column doesn't look very easy. I've searched for multiple GROUP BY clauses and found nothing. But at least doing three passes over the data using this command will be very quick.Paul Pritchard– Paul Pritchard2019年12月06日 08:30:16 +00:00Commented Dec 6, 2019 at 8:30