Bit of a complicated one for me.
I have a database full of hundreds of thousands of records, many of which are duplicated.
I need to get all records within the last year but making sure every instance of that record is within the last year, e.g. if a record is duplicated and one is older than a year this shouldn't be included.
So far I have the below...
Step 1 - find out earliest date for each record
SELECT MIN(CreateDate) AS Date, Email FROM Results R
WHERE (R.Email IS NOT NULL AND R.Email <> '')
GROUP BY R.Email
I created this as a view and called it [EarliestInteraction]
Step 2 - grab all within the last year
Note: I need records within the last year but they also need to be in a log table also. So all records within the last year that are also present in some log tables.
So far I have done this...
SELECT * FROM EarliestInteraction ECI
WHERE ( CAST(ECI.Date AS DATE) >= CAST(GETDATE() - 365 AS DATE) )
AND (
EXISTS (
SELECT Id FROM LOG1 R
WHERE Source = 'LOGGED'
AND R.Email = ECI.Email
)
OR
EXISTS (
SELECT Id FROM LOG2 R WHERE (R.Email IS NOT NULL AND R.Email <> '')
AND R.Email = ECI.Email
AND R.EventType IN (
'LOGGED'
))
)
My question is, is this a good way of doing this and accurate?
Or am I missing something that would bring back earlier duplicates...
Any thoughts on if this is accurate or achieves the brief would be great.
2 Answers 2
Your view does not filter on the most recent year.
SELECT MIN(CreateDate) AS Date, Email FROM Results R WHERE (R.Email IS NOT NULL AND R.Email <> '') GROUP BY R.Email
This filter does not change that because your view has already grouped results taking into account data older than 1 year.
SELECT * FROM EarliestInteraction ECI WHERE ( CAST(ECI.Date AS DATE) >= CAST(GETDATE() - 365 AS DATE) ) ..
You don't need a view, just a CTE.
with (SELECT MIN(CreateDate) AS Date, Email
FROM Results R
WHERE (R.Email IS NOT NULL AND R.Email <> '')
-- add the filter before the GroupBy !
AND ( CAST(CreateDate AS DATE) >= Cast(DATEADD(year,-1,GETDATE()) as Date) )
GROUP BY R.Email) as ECI
select *
from ECI
where (
EXISTS (
SELECT Id FROM LOG1 R
WHERE Source = 'LOGGED'
AND R.Email = ECI.Email
)
OR
EXISTS (
SELECT Id FROM LOG2 R WHERE (R.Email IS NOT NULL AND R.Email <> '')
AND R.Email = ECI.Email
AND R.EventType IN (
'LOGGED'
))
);
Or in short
with (SELECT MIN(CreateDate) AS Date, Email
FROM Results R
WHERE (R.Email IS NOT NULL AND R.Email <> '')
-- add the filter before the GroupBy !
AND ( CAST(CreateDate AS DATE) >= Cast(DATEADD(year,-1,GETDATE()) as Date) )
GROUP BY R.Email) as ECI
select *
from ECI
inner join LOG1 r1 on r1.Email = ECI.Email and r1.Source = 'LOGGED'
inner join LOG2 r2 on r2.Email = ECI.Email and r2.Source = 'LOGGED'
;
The following are a few suggestions on how I'd write your view.
Source Control
If you don't already have a database project, create one in Visual Studio. Then check it in to source control. Microsoft Azure DevOps Services is free & private for teams of 5 or less (this is per project, so 5 developers per project). Then you'll be able to track changes you make to your stored procedures, views, tables, etc.
Formatting
I would download the following tool for SSMS and Visual Studio, Poor Man's T-Sql Formatter and on GitHub. I use it when I have to edit other developer's code. It's a great way to standardize your SQL. I find it does most of the formatting for me, but I'll still make a few changes after.
Here are the settings I used:
Commas
I would put the commas in front to clearly define new columns. Versus code wrapped in multiple lines. It also makes trouble-shooting code easier.
Where Clause
If you put 1=1
at the top of a WHERE
condition, it enables you to freely change the rest of the conditions when debugging a query. The SQL query engine will end up ignoring the 1=1
so it should have no performance impact. Reference
Common Table Expressions (CTE)
CTE's in your SQL help with documentation. The expression name can then let other developers know why you used that expression e.g. current_suppliers
or active_projects
.
Schema Names
Always reference the schema when selecting an object e.g. [dbo].[SalesTable]
.
Estimated Execution Plan
It's a good idea to check the Estimated Execution Plan. The shortcut in Microsoft SQL Server Management Studio (SSMS) is Ctrl + L. You can even run 2 queries in the same tab to compare the plans.
Keywords
Avoid using keywords as object names. Microsoft Reference
- Also check out the book Clean Code. It will change the way you think about naming conventions.
Revised SQL
Without table definitions and sample records I was unable to test this, but it should give you a good start.
WITH
earliest_interaction
AS
(
SELECT
[min_createdate] = CAST(MIN(r.[CreateDate]) AS DATE)
, r.[Email]
FROM
[dbo].[Results] AS r
WHERE
1=1
AND (r.[Email] IS NOT NULL AND r.[Email] != '')
GROUP BY
r.[Email]
)
,
log_files
AS
(
SELECT [Email] FROM [dbo].[LOG1] WHERE [Source] = 'LOGGED'
UNION
SELECT [Email] FROM [dbo].[LOG2] WHERE [Source] = 'LOGGED'
)
SELECT
ei.[min_createdate]
, ei.[Email]
FROM
earliest_interaction AS ei
INNER JOIN log_files AS lf ON lf.[Email] = ei.[Email]
WHERE
1=1
AND (ei.[min_createdate] >= DATEADD(YEAR, -1, GETDATE()))