How do I find duplicate values in multiple columns over different rows?

Question 1

I have a database with a table of games (some columns omitted):

gameid
hometeam
awayteam
`date`

Sometimes games are added with the same home and away teams, but on a different date that is just 1 day difference (game date moved), and the old game isn't deleted. I'd like to be able to search over a variable number of days as well. Thus, for football, having the same opponent in a time span of +/- 5 days would be wrong, but possible in basketball or baseball or many other sports.

It's easy to search for multiple values of the same home/away/date, but the date difference makes this harder.

Also, the home and away team may be swapped, with the date also being the same or slightly different.

Example data:

gameid hometeam awayteam date
5 777 999 2014年10月23日
6 999 777 2014年10月23日
7 777 999 2014年10月24日
8 777 999 2014年10月25日

All of these are duplicates. Determining which doesn't matter, just that it should let me know that there are 4 games scheduled for this which it should (probably) be 1.

This is what I use to find duplicated games for the same home/away/date:

SELECT COUNT(*) as num,hometeam as teamid,`date` FROM `game` WHERE sportid=1 AND 'deleted_at' IS NULL AND `date` BETWEEN '2014-07-01' AND '2015-06-30' GROUP BY `date`,hometeam HAVING `num`>1
 UNION
SELECT COUNT(*) as num,awayteam as teamid,`date` FROM `game` WHERE sportid=1 AND 'deleted_at' IS NULL AND `date` BETWEEN '2014-07-01' AND '2015-06-30' GROUP BY `date`,awayteam HAVING `num`>1 ORDER BY `num` DESC;

Question 2

Not sure why the question popped up now, but if you are still interested in an answer something like:

select g1.* 
from games g1 
where exists ( 
 select 1 
 from games g2 
 where g1.gameid <> g2.gameid 
 and least(g1.hometeam,g1.awayteam) 
 = least(g2.hometeam,g2.awayteam) 
 and greatest(g1.hometeam,g1.awayteam) 
 = greatest(g2.hometeam,g2.awayteam) 
 and abs(datediff(g1.d, g2.d)) < 2
);

should give you what you need

Question 3

Since date is a reserved word I used d as the column name.

Question 4

I can see that this question has been here a few days and no one has taken a stab at it yet. I'm not familiar with mySQL so I can't give you a sample that will work, but here is an idea for you.

Add another column to your table with a hash of the two teamID's. You will need to take care that the teams are entered in the same order, say ascending by their ID but that would allow you to uniquely identify a combination of teams.

Perhaps using md5? I came up with something like the query below which would work for MS-SQL.

 SELECT gameid
, hometeam
, awayteam
, [date]
, teamhash = HASHBYTES('md5', CASE WHEN hometeam < awayteam 
 THEN hometeam 
 ELSE awayteam 
 END 
 + CASE WHEN hometeam > awaytem 
 THEN hometeam 
 ELSE awayteam 
 END)
FROM gamedata

Then you can query against that looking for patterns. To make it perform better, you could add a table that contained a list of teams and all possible matches with their hashes.

Hope that helps.

score 1 · Answer 1 · 2015-01-25 15:36:19Z

Not sure why the question popped up now, but if you are still interested in an answer something like:

select g1.* 
from games g1 
where exists ( 
 select 1 
 from games g2 
 where g1.gameid <> g2.gameid 
 and least(g1.hometeam,g1.awayteam) 
 = least(g2.hometeam,g2.awayteam) 
 and greatest(g1.hometeam,g1.awayteam) 
 = greatest(g2.hometeam,g2.awayteam) 
 and abs(datediff(g1.d, g2.d)) < 2
);

should give you what you need

Since date is a reserved word I used d as the column name.

Lennart - Slava Ukraini
– Lennart - Slava Ukraini

2015年01月25日 15:38:32 +00:00
Commented Jan 25, 2015 at 15:38

Jonathan Fite Jonathan Fite 9,4341 gold badge26 silver badges30 bronze badges · Answer 2 · 2014-10-28 19:38:11Z

I can see that this question has been here a few days and no one has taken a stab at it yet. I'm not familiar with mySQL so I can't give you a sample that will work, but here is an idea for you.

Add another column to your table with a hash of the two teamID's. You will need to take care that the teams are entered in the same order, say ascending by their ID but that would allow you to uniquely identify a combination of teams.

Perhaps using md5? I came up with something like the query below which would work for MS-SQL.

 SELECT gameid
, hometeam
, awayteam
, [date]
, teamhash = HASHBYTES('md5', CASE WHEN hometeam < awayteam 
 THEN hometeam 
 ELSE awayteam 
 END 
 + CASE WHEN hometeam > awaytem 
 THEN hometeam 
 ELSE awayteam 
 END)
FROM gamedata

Then you can query against that looking for patterns. To make it perform better, you could add a table that contained a list of teams and all possible matches with their hashes.

Hope that helps.

Stack Exchange Network

How do I find duplicate values in multiple columns over different rows?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How do I find duplicate values in multiple columns over different rows?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions