MSSQL query to look for duplicate record

Question 1

This query took 6 seconds to complete. How can I optimize it? Total records in table is 166803.

SELECT ltrim(rtrim(CAST(cageID as nvarchar(max))))+ltrim(rtrim(CAST(trayNo as nvarchar(max)))) as _unique,* 
from lf_transit_cage
where ltrim(rtrim(CAST(cageID as nvarchar(max))))+ltrim(rtrim(CAST(trayNo as nvarchar(max)))) in
(
 SELECT dt._unique FROM
 (
 SELECT ltrim(rtrim(CAST(cageID as nvarchar(max))))+ltrim(rtrim(CAST(trayNo as nvarchar(max)))) as _unique 
 from lf_transit_cage 
 ) as dt
 group by dt._unique 
 HAVING COUNT(dt._unique)>1
)
order by cageID,trayNo

Question 2

You're doing a WHERE condition on computed values, which means MSSQL can't use indexes. Consider creating temporary in-memory tables with INDEXes, though I can't promise that will be faster.

Question 3

Do you really need this string concatenation or are you just trying to find duplicate cageID, trayNoin a very misguided way?

Question 4

I think the main issue is that the data is stored in the wrong type, requiring a cast, but also isn't cleaned before insertion, requiring trimming.

Question 5

As mentioned in the comments, there are benefits to casting/storing that unique key in the table during the ETL process, especially if it's going to be used in other places than just this query.

Most likely, the performance hit is coming from using IN (typically results in a row by row lookup) and from de-duping with the casted key. You could get a performance gain from JOINing the subequery instead of using IN. You could also use ROW_NUMBER which, in my experience, is typically more performant than the GROUP BY with HAVING clause.

Here's my example using ROW_NUMBER and CTE's for easier reading:

--Calculate Unique NVARCHAR key
;WITH cte_lf_transit_cage AS (
 SELECT
 ltrim(rtrim(CAST(cageID as nvarchar(max))))+ltrim(rtrim(CAST(trayNo as nvarchar(max)))) as _unique,
 *
 FROM 
 lf_transit_cage
)
--Get the Row Count
, cte_rowcount AS (
 SELECT
 _unique,
 ROW_NUMBER() OVER (PARTITION BY _unique ORDER BY cageID, trayNo) AS rowcnt
 FROM
 cte_lf_transit_cage
)
--Grab all instances of duplicate rows
SELECT
 ltc.*
FROM
 cte_lf_transit_cage ltc
WHERE
 EXISTS
 (SELECT unique FROM cte_rowcount rc WHERE rc._unique = ltc._unique AND rc.rowcnt > 1 )
ORDER BY
 ltc.cageID,
 ltc.trayNo

Also, was mentioned in the comments that you may not need to generate the _unique key depending on how the data is stored. Might compare results to confirm:

--Get the Row Count
;WITH cte_rowcount AS (
 SELECT
 cageID,
 trayNo,
 ROW_NUMBER() OVER (PARTITION BY cageID, trayNo ORDER BY trayNo) AS rowcnt
 FROM
 lf_transit_cage
)
--Grab all instances of duplicate rows
SELECT
 ltrim(rtrim(CAST(ltc.cageID as nvarchar(max))))+ltrim(rtrim(CAST(ltc.trayNo as nvarchar(max)))) as _unique,
 ltc.*
FROM
 lf_transit_cage ltc
WHERE
 EXISTS
 (SELECT * FROM cte_rowcount rc WHERE rc.cageID = ltc.cageID AND rc.trayNo = ltc.trayNo AND rc.rowcnt > 1 )
ORDER BY
 ltc.cageID,
 ltc.trayNo

Question 6

codes only work if I put cageId and trayNo in the last GROUP BY.

Question 7

result are different from my codes result too.

Question 8

Yeah, Grouping by CageId & TrayNo would result in duplicates, so I updated to just order by that unique key. Does that match your results now?

Question 9

My result show all the duplicate values but your result show only one of each duplicate values even with GROUP BY _unique, cageId, trayNo.

Question 10

Ah, right you are, based on your example, we need to return all results. Update using EXISTS, let me know if that helps.

vanlee1987 vanlee1987 1514 bronze badges · Accepted Answer · 2016-02-10 03:21:40Z

As mentioned in the comments, there are benefits to casting/storing that unique key in the table during the ETL process, especially if it's going to be used in other places than just this query.

Most likely, the performance hit is coming from using IN (typically results in a row by row lookup) and from de-duping with the casted key. You could get a performance gain from JOINing the subequery instead of using IN. You could also use ROW_NUMBER which, in my experience, is typically more performant than the GROUP BY with HAVING clause.

Here's my example using ROW_NUMBER and CTE's for easier reading:

--Calculate Unique NVARCHAR key
;WITH cte_lf_transit_cage AS (
 SELECT
 ltrim(rtrim(CAST(cageID as nvarchar(max))))+ltrim(rtrim(CAST(trayNo as nvarchar(max)))) as _unique,
 *
 FROM 
 lf_transit_cage
)
--Get the Row Count
, cte_rowcount AS (
 SELECT
 _unique,
 ROW_NUMBER() OVER (PARTITION BY _unique ORDER BY cageID, trayNo) AS rowcnt
 FROM
 cte_lf_transit_cage
)
--Grab all instances of duplicate rows
SELECT
 ltc.*
FROM
 cte_lf_transit_cage ltc
WHERE
 EXISTS
 (SELECT unique FROM cte_rowcount rc WHERE rc._unique = ltc._unique AND rc.rowcnt > 1 )
ORDER BY
 ltc.cageID,
 ltc.trayNo

Also, was mentioned in the comments that you may not need to generate the _unique key depending on how the data is stored. Might compare results to confirm:

--Get the Row Count
;WITH cte_rowcount AS (
 SELECT
 cageID,
 trayNo,
 ROW_NUMBER() OVER (PARTITION BY cageID, trayNo ORDER BY trayNo) AS rowcnt
 FROM
 lf_transit_cage
)
--Grab all instances of duplicate rows
SELECT
 ltrim(rtrim(CAST(ltc.cageID as nvarchar(max))))+ltrim(rtrim(CAST(ltc.trayNo as nvarchar(max)))) as _unique,
 ltc.*
FROM
 lf_transit_cage ltc
WHERE
 EXISTS
 (SELECT * FROM cte_rowcount rc WHERE rc.cageID = ltc.cageID AND rc.trayNo = ltc.trayNo AND rc.rowcnt > 1 )
ORDER BY
 ltc.cageID,
 ltc.trayNo

codes only work if I put cageId and trayNo in the last GROUP BY.
Yeah, Grouping by CageId & TrayNo would result in duplicates, so I updated to just order by that unique key. Does that match your results now?
My result show all the duplicate values but your result show only one of each duplicate values even with GROUP BY _unique, cageId, trayNo.
Ah, right you are, based on your example, we need to return all results. Update using EXISTS, let me know if that helps.

Stack Exchange Network

MSSQL query to look for duplicate record

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

MSSQL query to look for duplicate record

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions