How to efficiently eliminate data from select when rows are different by one column value

Question 1

I have such TABLE with data:

A | B | C | D | E 
-----------------------
a1 | b1 | c1 | d1 | PA
a1 | b1 | c1 | d1 | PR
a2 | b2 | c2 | d2 | PA
a3 | b3 | c3 | d3 | PA
a3 | b3 | c3 | d3 | PR

row1 has its counterpart row2 (they are only different by col E)
row3 has no counterpart - unique
row4 has its counterpart row5 ( same as first statement - col E )

the query should gave me a result like this:

A | B | C | D | E 
-----------------------
a2 | b2 | c2 | d2 | PA

only unique row which has no counterpart - no row with same set of data only which different only by value in column E

we can apply here a few approaches which one will be most effective if we have more rows with its counterparts as unique rows (without counterpart)
which join use ?
which one created select statement subtract or add ?

ps. an extended case

A | B | C | D | E 
----------------------------------------
a1 | b1 | c1 | 2017年05月16日 11:46:36 | PA
a1 | b1 | c1 | 2017年05月16日 11:46:37 | PR
a2 | b2 | c2 | 2017年05月17日 01:34:28 | PA
a3 | null | c3 | 2017年05月12日 19:14:15 | PA 
a3 | b3 | c3 | 2017年05月12日 19:14:15 | PR 
a4 | b4 | c4 | 2017年05月12日 19:16:15 | PA 
a4 | b4 | c4 | 2017年05月12日 23:16:15 | PR

column A -D may contain null values (row4)
column D is a TIMESTAMP and it could vary from its counterpart by 1-2minutes (row1, row2)
we add the max difference in timestamp = example 2 minutes (row6 has no counterpart as row7 vs row6 timestamp difference is greeter than 2 minuts (4 hours)
we don't want any row with PR in our result

expected result:

A | B | C | D | E 
----------------------------------------
a2 | b2 | c2 | 2017年05月17日 01:34:28 | PA
a4 | b4 | c4 | 2017年05月12日 19:16:15 | PA

Question 2

In your extended case, which is the intended result? Rows 4 and 5 where considered "equivalent, and non-reported" in your original scenario. What should happen now? Which time-threshold to use to consider that two times are equivalent or not?

Question 3

@joanolo see edit

Question 4

See edit on response.

Question 5

I wouldn't use a JOIN. If you would like to look for rows that have a set of columns in common (in your case, all but one), I would just GROUP the rows by the common columns. Then, restrict which groups you want to return (using a HAVING clause):

SELECT
 A, B, C, D, min(E) AS E
FROM
 t
GROUP BY
 A, B, C, D
HAVING
 count(*) = 1 ;

The min(E) part is just a trick to be "SQL compliant". max(E), or first(E) (for the databases that have it) would do the job as well. As you're choosing just groups with a single row, the min value of a column within the group is the single value there is. In mySQL 5.6 and earlier, leaving E as a column would work as well; although this is not SQL standard and doesn't work in 5.7 (thanks to ypercubeTM for pointing the fact that this doesn't work on 5.7).

Results are :

| A | B | C | D | E |
|----|----|----|----|----|
| a2 | b2 | c2 | d2 | PA |

You can get everything at this SQL Fiddle

You can also take a different approach. Look for rows that literally comply with what you're looking for: that is, it does NOT exist another row with the same A, B, C and D and different E. In this case, use an EXISTS condition in your WHERE clause:

SELECT
 A, B, C, D, E
FROM
 t AS t0
WHERE
 NOT EXISTS 
 (SELECT 
 *
 FROM
 t AS t1
 WHERE
 t1.A = t0.A
 AND t1.B = t0.B
 AND t1.C = t0.C
 AND t1.D = t0.D
 AND t1.E <> t0.E
 ) ;

Older versions of MySQL would probably not handle this query too well, because EXISTS conditions are not well optimized. mySQL 5.7 does a decent job; and so do most other databases (Oracle, MS SQL Server, PostgreSQL, ...) . You do need an index on (A, B, C, D, E) (or, at least, (A, B, C, D) to be really performant.

SQL Fiddle

The two queries will not work the same in the case where you have two exact duplicate rows (which I interpret is a case not allowed by your spec). The first one will return no rows, whereas the second one will return both rows.

MySQL 5.6 Schema Setup:

CREATE TABLE t
(
 `A` varchar(2), 
 `B` varchar(2), 
 `C` varchar(2), 
 `D` varchar(2), 
 `E` varchar(2)
) ;
INSERT INTO t
 (`A`, `B`, `C`, `D`, `E`)
VALUES
 ('a1', 'b1', 'c1', 'd1', 'PA'),
 ('a1', 'b1', 'c1', 'd1', 'PR'),
 ('a2', 'b2', 'c2', 'd2', 'PA'),
 ('a3', 'b3', 'c3', 'd3', 'PA'),
 ('a3', 'b3', 'c3', 'd3', 'PR');

For your extended case, the second query is your starting point.

Change your equalities with whichever conditions meet your needs.

For instance, if your D columns is a timestamp, and you want to consider that "two times are equivalent if the difference is less than 2 minutes"... then change

t1.D = t0.D

by

abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120 -- seconds

The possibility of nulls (that work as "jokers") should be handled with:

(t1.A = t0.A or t0.A is null or t1.A is null)

If E is nullable, then the comparison is done with

NOT (t1.E <=> t0.E)

which is mySQL equivalent of (t1.E is distinct from t0.E)

This will give you:

 SELECT
 A, B, C, D, E
 FROM
 t AS t0
 WHERE
 NOT EXISTS 
 (SELECT 
 *
 FROM
 t AS t1
 WHERE
 (t1.A = t0.A or t1.A is NULL or t0.A is NULL)
 AND (t1.B = t0.B or t1.B is NULL or t0.B is NULL)
 AND (t1.C = t0.C or t1.C is NULL or t0.C is NULL)
 AND (abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120)
 AND NOT (t1.E <=> t0.E)
 ) 
 AND t0.E <> 'PR' ;

 A | B | C | D | E 
 :- | :- | :- | :------------------ | :-
 a2 | b2 | c2 | 2017年05月17日 01:34:28 | PA
 a4 | b4 | c4 | 2017年05月12日 19:16:15 | PA

dbfiddle here

Don't expect this kind of queries to be efficient in realistic scenarios.

Question 6

"In the current implementations of mySQL, leaving E as a column would work as well" No, it wouldn't in 5.7 (latest stable version)

Question 7

I stand corrected!

Question 8

and what about shivam-kumar he ads count on e column and check on this column ? in your example we are checking on all columns count

Question 9

@ceph3us count(*) and count(e) and count(a) and ... are all equivalent when the columns are not nullable.

Question 10

second question if i have a column with timestamp and its counterpart can vary by 1 minute how can I handle this ?

Question 11

Here is my test table:

enter image description here

I will give this a try:

SELECT A,B,C,D,E,count(E) as dup 
FROM TEST.TESTTABLE 
GROUP BY A,B,C,D
HAVING count(E) = 1;

Here is the result I got:

enter image description here

You can remove the Dup column added for demo purpose and use the below query instead:

SELECT A,B,C,D,E
FROM TEST.TESTTABLE 
GROUP BY A,B,C,D
HAVING count(E) = 1;

joanolo joanolo 13.7k8 gold badges39 silver badges67 bronze badges · Answer 1 · 2017-05-21 16:23:50Z

I wouldn't use a JOIN. If you would like to look for rows that have a set of columns in common (in your case, all but one), I would just GROUP the rows by the common columns. Then, restrict which groups you want to return (using a HAVING clause):

SELECT
 A, B, C, D, min(E) AS E
FROM
 t
GROUP BY
 A, B, C, D
HAVING
 count(*) = 1 ;

The min(E) part is just a trick to be "SQL compliant". max(E), or first(E) (for the databases that have it) would do the job as well. As you're choosing just groups with a single row, the min value of a column within the group is the single value there is. In mySQL 5.6 and earlier, leaving E as a column would work as well; although this is not SQL standard and doesn't work in 5.7 (thanks to ypercubeTM for pointing the fact that this doesn't work on 5.7).

Results are :

| A | B | C | D | E |
|----|----|----|----|----|
| a2 | b2 | c2 | d2 | PA |

You can get everything at this SQL Fiddle

You can also take a different approach. Look for rows that literally comply with what you're looking for: that is, it does NOT exist another row with the same A, B, C and D and different E. In this case, use an EXISTS condition in your WHERE clause:

SELECT
 A, B, C, D, E
FROM
 t AS t0
WHERE
 NOT EXISTS 
 (SELECT 
 *
 FROM
 t AS t1
 WHERE
 t1.A = t0.A
 AND t1.B = t0.B
 AND t1.C = t0.C
 AND t1.D = t0.D
 AND t1.E <> t0.E
 ) ;

Older versions of MySQL would probably not handle this query too well, because EXISTS conditions are not well optimized. mySQL 5.7 does a decent job; and so do most other databases (Oracle, MS SQL Server, PostgreSQL, ...) . You do need an index on (A, B, C, D, E) (or, at least, (A, B, C, D) to be really performant.

SQL Fiddle

The two queries will not work the same in the case where you have two exact duplicate rows (which I interpret is a case not allowed by your spec). The first one will return no rows, whereas the second one will return both rows.

MySQL 5.6 Schema Setup:

CREATE TABLE t
(
 `A` varchar(2), 
 `B` varchar(2), 
 `C` varchar(2), 
 `D` varchar(2), 
 `E` varchar(2)
) ;
INSERT INTO t
 (`A`, `B`, `C`, `D`, `E`)
VALUES
 ('a1', 'b1', 'c1', 'd1', 'PA'),
 ('a1', 'b1', 'c1', 'd1', 'PR'),
 ('a2', 'b2', 'c2', 'd2', 'PA'),
 ('a3', 'b3', 'c3', 'd3', 'PA'),
 ('a3', 'b3', 'c3', 'd3', 'PR');

For your extended case, the second query is your starting point.

Change your equalities with whichever conditions meet your needs.

For instance, if your D columns is a timestamp, and you want to consider that "two times are equivalent if the difference is less than 2 minutes"... then change

t1.D = t0.D

by

abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120 -- seconds

The possibility of nulls (that work as "jokers") should be handled with:

(t1.A = t0.A or t0.A is null or t1.A is null)

If E is nullable, then the comparison is done with

NOT (t1.E <=> t0.E)

which is mySQL equivalent of (t1.E is distinct from t0.E)

This will give you:

 SELECT
 A, B, C, D, E
 FROM
 t AS t0
 WHERE
 NOT EXISTS 
 (SELECT 
 *
 FROM
 t AS t1
 WHERE
 (t1.A = t0.A or t1.A is NULL or t0.A is NULL)
 AND (t1.B = t0.B or t1.B is NULL or t0.B is NULL)
 AND (t1.C = t0.C or t1.C is NULL or t0.C is NULL)
 AND (abs( unix_timestamp(t1.D) - unix_timestamp(t0.D) ) < 120)
 AND NOT (t1.E <=> t0.E)
 ) 
 AND t0.E <> 'PR' ;

 A | B | C | D | E 
 :- | :- | :- | :------------------ | :-
 a2 | b2 | c2 | 2017年05月17日 01:34:28 | PA
 a4 | b4 | c4 | 2017年05月12日 19:16:15 | PA

dbfiddle here

Don't expect this kind of queries to be efficient in realistic scenarios.

"In the current implementations of mySQL, leaving E as a column would work as well" No, it wouldn't in 5.7 (latest stable version)
and what about shivam-kumar he ads count on e column and check on this column ? in your example we are checking on all columns count
@ceph3us count(*) and count(e) and count(a) and ... are all equivalent when the columns are not nullable.
second question if i have a column with timestamp and its counterpart can vary by 1 minute how can I handle this ?

Shivam Kumar Shivam Kumar 3641 gold badge2 silver badges8 bronze badges · Answer 2 · 2017-05-21 16:31:04Z

Here is my test table:

enter image description here

I will give this a try:

SELECT A,B,C,D,E,count(E) as dup 
FROM TEST.TESTTABLE 
GROUP BY A,B,C,D
HAVING count(E) = 1;

Here is the result I got:

enter image description here

You can remove the Dup column added for demo purpose and use the below query instead:

SELECT A,B,C,D,E
FROM TEST.TESTTABLE 
GROUP BY A,B,C,D
HAVING count(E) = 1;

Stack Exchange Network

How to efficiently eliminate data from select when rows are different by one column value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to efficiently eliminate data from select when rows are different by one column value

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions