WHERE NOT IN SQL Performance

Question 1

SELECT TOP 1 l.ID 
FROM Leads l
WHERE 
 l.ID NOT IN (SELECT LeadID FROM LeadFollowups) 
 AND l.ID NOT IN (SELECT LeadID FROM LeadsWorking)
 AND l.ID NOT IN (SELECT LeadID FROM LeadsDead)
 AND l.ID NOT IN (SELECT LeadID FROM LeadHolds)
 AND l.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))
ORDER BY l.QualityScore DESC

I have a table growing by a 10 to 100 thousand rows a day. I have several tables to track actions of the current leads that shouldn't be included when grabbing the new lead. My query is pretty quick now, but as this grows is NOT IN the most efficient solution for this? Especially if I get in to the range of ~10 million rows in the Leads table?

Question 2

what does l.ID >= (RAND() * (SELECT MAX(ID) FROM Leads)) and why do you need this?

Question 3

Before adding the QualityScore calculation, I was just returning a random row with that line.

Question 4

so you weren't even checking the rows less than some random ID?

Question 5

Is LeadID nullable in any of the columns in the various LeadX tables? If so you should consider NOT EXISTS or explicitly adding WHERE LeadID IS NOT NULL

Question 6

If your data is correctly indexed with the ID column from each table as the primary key, then I don't see this query slowing down much as the data grows.

You are worrying prematurely, I think.... but, the purpose of the query is complicated... it is almost as if you are tying to find a random lead to prompt people to look in to 'next'.... where that lead has not been handled in some way yet.

I think you may have better performance with a cursor.... (in a stored procedure?).

Still, even if you have to do the large checks of all the 'handled' leads, you may find it faster to do just the one 'big' subselect instead of multiple smaller ones. you will need to test this on your system to get an idea of the performance.

with (
 SELECT LeadID FROM LeadFollowups
 UNION
 SELECT LeadID FROM LeadsWorking
 UNION
 SELECT LeadID FROM LeadsDead
 UNION
 SELECT LeadID FROM LeadHolds
) as handled
select TOP 1 leads.ID
from Leads leads
where leads.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))
 and leads.ID not in (select LeadID from handled)
ORDER BY leads.QualityScore DESC

I don't like that your query can return zero results when the RAND() value is large... and all higher ID's are handled.

As a cursor / procedure, it will process much less data, and, as a consequence, it will likely be faster. The big difference is that it will terminate early (when it has found a valid answer), rather than calculating all the valid answers, and selecting one of them.

It could be something like:

declare @leadid as int;
declare @count as int = 0;
declare LEADCURSOR cursor for
select leads.ID
from Leads leads
where leads.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))
ORDER BY leads.QualityScore DESC
open LEADIDS
fetch next from LEADIDSinto @leadid
while @@FETCH_STATUS = 0
begin
 select @count = count(*)
 from LeadFollowups, LeadsWorking, LeadsDead, LeadHolds
 where LeadFollowups.leadID = @leadid
 and LeadsWorking.leadID = @leadid
 and LeadsDead.leadID = @leadid
 and LeadHolds.leadID = @leadid
 if (@count = 0)
 BREAK
 fetch next from LEADIDSinto @leadid
end
close LEADIDS
deallocate LEADIDS
if (@count <> 0)
 set @leadid = null
select @leadid

Question 7

I am skeptical that the cursor processes much less data. For one thing, it's querying the various anti-join tables zero or more times, possible very many times.

Question 8

Perhaps that UNION could be a materialized view.

Question 9

@JonofAllTrades - With the >= condition on the RAND() function, I don't believe that SQLServer can use that as a deterministic predicate as it compiles the SQL Plan. Thus, the query, at best, because of the NOT IN, becomes a index-scan of Leads.ID, with nested-loop index-probe for each sub-table... Thus, every LeadID is scanned before it can satisfy the TOP 1... With the cursor approach we do the same thing, but start from the 'right place', and terminate early.

Question 10

My suspicion/hope (and why I recommend trying it ) is that the there will be much fewer cursor loops than nested-loops. This will depend on data distributions, usage patterns, etc.

Question 11

Thanks for the comments. You are exactly correct in the purpose of the function, to pull a random lead to work that is free.

Question 12

My query is pretty quick now, but as this grows is NOT IN the most efficient solution for this?

I expect it would be quicker if you had an indexed Status field in the Leads table (so that records in the LeadsDead table had a corresponding Status=Dead in the Leads table); and quicker still if Status were a clustered index, so that all the leads with a given Status were contiguous in the Leads table.

Question 13

if only we could go back in time and slap the person creating the database structure (in some cases) and tell them the right way to do it.

Question 14

EDIT:

If you can, you should add a column to your leads table, something like beingHandled (boolean/bit) then you could write a query that would set it once for all the records which might take a little bit of time but it only needs to be run once.

After you have done this you can query this table super simple

SELECT leads.ID
FROM Leads AS leads
WHERE beingHandled = true

Then you just need to keep this table up to date with your {application?}.

Let the application handle picking a random lead from the data set of leads that haven't been handled. In my opinion, asking a RDBMS to randomly pick a record seems completely wrong, I see that a database was created to organize data and not scramble it.

You also never mention if the record that was returned is going to be added to one of the tables that are mentioned in the where clause, so you could return the same result twice.

I assume that your application is handling the part of creating the record in the other tables, so why not let the application also update the Leads table beingHandled value to true as well.

Can you pick an ID and say anything less than that ID doesn't need to be checked?

I think that you will lose a little bit of performance by running the random function, probably not much, but if you can eliminate all the records older than some date, that would speed up the query by not having to run through so many records in the first place.

so this line of SQL,

AND l.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))

Smells Badly to me

you are saying that you only want to check more recent leads, what if this line of code says

l.ID >= ({123455}) -- Translated after code does it's thing

and there are only 123456 Records?

let's also think about this for a minute,

same line of code in the where statement,

the query comes to a row and checks all the where statements, which means that it is picking a different random ID for every record that it checks, this can't be accurate, and probably isn't what you intend this line of code to do.

you should look into changing this line of code

AND l.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))

rolfl rolflrolfl 98.1k17 gold badges219 silver badges419 bronze badges · Accepted Answer · 2014-02-25 18:00:48Z

If your data is correctly indexed with the ID column from each table as the primary key, then I don't see this query slowing down much as the data grows.

You are worrying prematurely, I think.... but, the purpose of the query is complicated... it is almost as if you are tying to find a random lead to prompt people to look in to 'next'.... where that lead has not been handled in some way yet.

I think you may have better performance with a cursor.... (in a stored procedure?).

Still, even if you have to do the large checks of all the 'handled' leads, you may find it faster to do just the one 'big' subselect instead of multiple smaller ones. you will need to test this on your system to get an idea of the performance.

with (
 SELECT LeadID FROM LeadFollowups
 UNION
 SELECT LeadID FROM LeadsWorking
 UNION
 SELECT LeadID FROM LeadsDead
 UNION
 SELECT LeadID FROM LeadHolds
) as handled
select TOP 1 leads.ID
from Leads leads
where leads.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))
 and leads.ID not in (select LeadID from handled)
ORDER BY leads.QualityScore DESC

I don't like that your query can return zero results when the RAND() value is large... and all higher ID's are handled.

As a cursor / procedure, it will process much less data, and, as a consequence, it will likely be faster. The big difference is that it will terminate early (when it has found a valid answer), rather than calculating all the valid answers, and selecting one of them.

It could be something like:

declare @leadid as int;
declare @count as int = 0;
declare LEADCURSOR cursor for
select leads.ID
from Leads leads
where leads.ID >= (RAND() * (SELECT MAX(ID) FROM Leads))
ORDER BY leads.QualityScore DESC
open LEADIDS
fetch next from LEADIDSinto @leadid
while @@FETCH_STATUS = 0
begin
 select @count = count(*)
 from LeadFollowups, LeadsWorking, LeadsDead, LeadHolds
 where LeadFollowups.leadID = @leadid
 and LeadsWorking.leadID = @leadid
 and LeadsDead.leadID = @leadid
 and LeadHolds.leadID = @leadid
 if (@count = 0)
 BREAK
 fetch next from LEADIDSinto @leadid
end
close LEADIDS
deallocate LEADIDS
if (@count <> 0)
 set @leadid = null
select @leadid

I am skeptical that the cursor processes much less data. For one thing, it's querying the various anti-join tables zero or more times, possible very many times.
@JonofAllTrades - With the >= condition on the RAND() function, I don't believe that SQLServer can use that as a deterministic predicate as it compiles the SQL Plan. Thus, the query, at best, because of the NOT IN, becomes a index-scan of Leads.ID, with nested-loop index-probe for each sub-table... Thus, every LeadID is scanned before it can satisfy the TOP 1... With the cursor approach we do the same thing, but start from the 'right place', and terminate early.
My suspicion/hope (and why I recommend trying it ) is that the there will be much fewer cursor loops than nested-loops. This will depend on data distributions, usage patterns, etc.
Thanks for the comments. You are exactly correct in the purpose of the function, to pull a random lead to work that is free.

Stack Exchange Network

WHERE NOT IN SQL Performance

3 Answers 3

EDIT:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

WHERE NOT IN SQL Performance

3 Answers 3

EDIT:

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions