I have a query as part of my process, that I execute hundreds of times in a loop.
Initially, Table A contains all records (20mil). Table B contains 0 records. Primary key in both tables ID the I execute:
select * from table A where a.ID not in (select ID from table b) limit 10000
##magic stuff in python
insert everything to table B, once again, .
Initially, the query runs super fast, but after Nth loop (100th+), size of table B increases, to the point where I it takes a bit of time to perform the NOT IN operation.
Does anyone have recommendations on how I can speed up the query? - So far, I've tweaked the default mysql bugger to be 1.5gbs (ids are pretty small INTs, so that should be enough).
Caveats:
1) One way to do this would be to remove * from table A after I've processed them. However, I want to keep table A in tact.
... only method I could think of I adding another column to table A (which I'd index) called 'PROCESSED'... then update that column with a second query once the records have been processed/posted, but I was hoping there was an easier solution.
Thank you all in advance.
1 Answer 1
If your goal is to look at every row in A
and do something with it, there is a much more efficient way. (It seems that B
is merely there to see what you have already processed.)
The reason for it getting slower is that it has to do more work as it gets farther into A
-- namely skipping over the rows it has processed. A processed
flag might suffer the same malady.
So...
Walk through A
processing chunks as you go. Then remember where you left off so that the next 10000 will be right there, no searching. I discuss that in more detail with an eye to DELETEing
, but it can be adapted for other purposes: http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks With that, B
is unnecessary.
Partitioning
If you are thinking about partitioning the data by months, I have to ask you "Why?". Here are some answers:
- For performance? You won't get any.
- For rapid deletion of "old" data? This is a good use case, but be sure to use
PARTITION BY RANGE(...)
and include the year, too. More: http://mysql.rjweb.org/doc.php/partitionmaint
-
this could work as well. Not could, but would. What did end up doing is increasing the chunk size by 100x , and the query seems to work just as fast. In fact, I'm doing a test run again, just to see how fast all records get processed - looks like its going to take 15 mins vs original 3 hours. So I am happy with thatFlyingZebra1– FlyingZebra12019年08月31日 01:27:46 +00:00Commented Aug 31, 2019 at 1:27
-
@FlyingZebra1 - I prefer chunks of 100 to 1000 for a variety of reasons. And, as you say, the chunk size does not matter to performance -- it is "linear" or "O(N)". Yours was "quadradic" or "O(N^2)".Rick James– Rick James2019年08月31日 01:30:18 +00:00Commented Aug 31, 2019 at 1:30
-
to summarize: what i noticed is that pulling 10k records using the original query, takes roughly same amount of time as pulling 1M records... which is something I can afford to do in this case, since I'm not pulling in much data. One thing about this process is that the IDs aren't going to come in sequentially/be processed sequentially, since I will need to run this query every few days. So it may be worthwhile to stick to the process I have + maybe foreign keys?FlyingZebra1– FlyingZebra12019年08月31日 01:32:19 +00:00Commented Aug 31, 2019 at 1:32
-
@FlyingZebra1 -- if the purpose is to re_process all the data every few days, then this chunking should work. If you need to _avoid reprocessing, then consider the
processed
flag, but still walk through the table in chunks -- 10K chunks may find only a few hundred "new" (processed = 0
) items next week, but it still works reasonably efficiently.Rick James– Rick James2019年08月31日 01:46:51 +00:00Commented Aug 31, 2019 at 1:46 -
1@FlyingZebra1 Any chance you will post your CURRENT solution that runs faster than the original with SHOW CREATE TABLE (tbl_name's) involved?Wilson Hauck– Wilson Hauck2019年10月15日 15:31:12 +00:00Commented Oct 15, 2019 at 15:31
B
.