Speed up SQL NOT IN query MySQL

Question 1

I have a query as part of my process, that I execute hundreds of times in a loop.

Initially, Table A contains all records (20mil). Table B contains 0 records. Primary key in both tables ID the I execute:

select * from table A where a.ID not in (select ID from table b) limit 10000
##magic stuff in python
insert everything to table B, once again, .

Initially, the query runs super fast, but after Nth loop (100th+), size of table B increases, to the point where I it takes a bit of time to perform the NOT IN operation.

Does anyone have recommendations on how I can speed up the query? - So far, I've tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).

Caveats:

1) One way to do this would be to remove * from table A after I've processed them. However, I want to keep table A in tact.

... only method I could think of I adding another column to table A (which I'd index) called 'PROCESSED'... then update that column with a second query once the records have been processed/posted, but I was hoping there was an easier solution.

Thank you all in advance.

Question 2

Please explain the purpose of B.

Question 3

purpose of B is to know what has already been processed. Note - This process will be executed every few days, and ID's may not come in sequentially - IE... First run i may have 2,3,5 ... process that, then two days later 1 and 4 will come, so i'll have to process those :(

Question 4

If your goal is to look at every row in A and do something with it, there is a much more efficient way. (It seems that B is merely there to see what you have already processed.)

The reason for it getting slower is that it has to do more work as it gets farther into A -- namely skipping over the rows it has processed. A processed flag might suffer the same malady.

So...

Walk through A processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye to DELETEing, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that, B is unnecessary.

Partitioning

If you are thinking about partitioning the data by months, I have to ask you "Why?". Here are some answers:

For performance? You won't get any.
For rapid deletion of "old" data? This is a good use case, but be sure to use PARTITION BY RANGE(...) and include the year, too. More: http://mysql.rjweb.org/doc.php/partitionmaint

Question 5

this could work as well. Not could, but would. What did end up doing is increasing the chunk size by 100x , and the query seems to work just as fast. In fact, I'm doing a test run again, just to see how fast all records get processed - looks like its going to take 15 mins vs original 3 hours. So I am happy with that

Question 6

@FlyingZebra1 - I prefer chunks of 100 to 1000 for a variety of reasons. And, as you say, the chunk size does not matter to performance -- it is "linear" or "O(N)". Yours was "quadradic" or "O(N^2)".

Question 7

to summarize: what i noticed is that pulling 10k records using the original query, takes roughly same amount of time as pulling 1M records... which is something I can afford to do in this case, since I'm not pulling in much data. One thing about this process is that the IDs aren't going to come in sequentially/be processed sequentially, since I will need to run this query every few days. So it may be worthwhile to stick to the process I have + maybe foreign keys?

Question 8

@FlyingZebra1 -- if the purpose is to re_process all the data every few days, then this chunking should work. If you need to _avoid reprocessing, then consider the processed flag, but still walk through the table in chunks -- 10K chunks may find only a few hundred "new" (processed = 0) items next week, but it still works reasonably efficiently.

Question 9

@FlyingZebra1 Any chance you will post your CURRENT solution that runs faster than the original with SHOW CREATE TABLE (tbl_name's) involved?

Rick James Rick James 80.7k5 gold badges52 silver badges119 bronze badges · Accepted Answer · 2019-08-31 01:25:10Z

1

If your goal is to look at every row in A and do something with it, there is a much more efficient way. (It seems that B is merely there to see what you have already processed.)

The reason for it getting slower is that it has to do more work as it gets farther into A -- namely skipping over the rows it has processed. A processed flag might suffer the same malady.

So...

Walk through A processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye to DELETEing, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that, B is unnecessary.

Partitioning

If you are thinking about partitioning the data by months, I have to ask you "Why?". Here are some answers:

For performance? You won't get any.
For rapid deletion of "old" data? This is a good use case, but be sure to use PARTITION BY RANGE(...) and include the year, too. More: http://mysql.rjweb.org/doc.php/partitionmaint

Share

Improve this answer

edited Oct 15, 2019 at 2:16

answered Aug 31, 2019 at 1:25

Rick James's user avatar

Rick James Rick James

80.7k5 gold badges52 silver badges119 bronze badges

9

this could work as well. Not could, but would. What did end up doing is increasing the chunk size by 100x , and the query seems to work just as fast. In fact, I'm doing a test run again, just to see how fast all records get processed - looks like its going to take 15 mins vs original 3 hours. So I am happy with that

FlyingZebra1
– FlyingZebra1

2019年08月31日 01:27:46 +00:00
Commented Aug 31, 2019 at 1:27
@FlyingZebra1 - I prefer chunks of 100 to 1000 for a variety of reasons. And, as you say, the chunk size does not matter to performance -- it is "linear" or "O(N)". Yours was "quadradic" or "O(N^2)".

Rick James
– Rick James

2019年08月31日 01:30:18 +00:00
Commented Aug 31, 2019 at 1:30
to summarize: what i noticed is that pulling 10k records using the original query, takes roughly same amount of time as pulling 1M records... which is something I can afford to do in this case, since I'm not pulling in much data. One thing about this process is that the IDs aren't going to come in sequentially/be processed sequentially, since I will need to run this query every few days. So it may be worthwhile to stick to the process I have + maybe foreign keys?

FlyingZebra1
– FlyingZebra1

2019年08月31日 01:32:19 +00:00
Commented Aug 31, 2019 at 1:32
@FlyingZebra1 -- if the purpose is to re_process all the data every few days, then this chunking should work. If you need to _avoid reprocessing, then consider the processed flag, but still walk through the table in chunks -- 10K chunks may find only a few hundred "new" (processed = 0) items next week, but it still works reasonably efficiently.

Rick James
– Rick James

2019年08月31日 01:46:51 +00:00
Commented Aug 31, 2019 at 1:46
1

@FlyingZebra1 Any chance you will post your CURRENT solution that runs faster than the original with SHOW CREATE TABLE (tbl_name's) involved?

Wilson Hauck
– Wilson Hauck

2019年10月15日 15:31:12 +00:00
Commented Oct 15, 2019 at 15:31

| Show 4 more comments

Stack Exchange Network

Speed up SQL NOT IN query MySQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Speed up SQL NOT IN query MySQL

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions