Excel VBA optimizing for large data base

Question 1

I am new to coding and need help with a code that won't complete. I suspect it is due to the size of the data set. I tested the code using a reduced data set and it processes fine. However, my actual data set is over 210,000 rows and is expected to grow. Is there a way to speed this up? Thank you for your assistance

Sub DupValidation()
 Dim wb As Workbook
 Dim ws1 As Worksheet
 Dim i As Long
 Dim lastrow As Long
 Dim lastrow2 As Long
 Set wb = ActiveWorkbook
 Set ws1 = wb.Worksheets("Tickets")
 lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
 ws1.Range("g2:g" & lastrow).ClearContents
 i = 2
 Do While i <= lastrow
 If Application.CountIf(ws1.Range(ws1.Cells(2, 2), ws1.Cells(lastrow, 2)), ws1.Cells(i, 2)) > 1 Then
 ws1.Cells(i, 7).Value = True
 End If
 i = i + 1
 Loop
End Sub

Question 2

What does your code do? I've read your title and description but other than your code being slow I'm left in the dark. Please edit your title and question to explain this.

Question 3

Wouldn't a "Highlight Duplicates" conditional formatting rule work better for this? Or is the True value used for something afterwards?

Question 4

Finding (and removing) duplicates is already a built-in functionality; this code is kind of reinventing the wheel. Making VBA code faster than native functionality, from what I've seen, requires thinking outside the box and manipulating memory locations.

But let's review the code we're looking at.

Sub DupValidation()

Kudos for using a meaningful name! This one reads like a noun though; since Sub procedures typically do something, it's common practice to make their names start with a verb, so ValidateDuplicates would be a better name. The procedure is also implicitly Public; consider always using explicit modifiers.

Next we have a handful of declarations, for each of the variables used in the procedure. Again, kudos for explicitly declaring all variables (is Option Explicit specified?) - but consider declaring them when they're needed, as they're needed. That way it's much harder to accidentally leave unused variables behind, like what happened to lastrow2.

Rule of thumb, if there's a digit at the end of a name, it's usually a bad name. ws1 could be sheet, or ticketsSheet.

But you're not here to hear about variable naming are you.

I'm curious why Range("g2:g" & lastrow) is good enough to get a range encompassing all cells in column G, but once inside the loop we switch to a very long-winded Range(Cells, Cells) call instead of just doing Range("B2:B" & lastrow) like we just did.

Do we really need to COUNTIF to identify a 2nd or 3rd (or 250th) duplicate for one value?

If we had a data structure we could quickly lookup a value from, we could put the known-dupe value in it, and then only perform the expensive COUNTIF when we already know we're not looking at a known duplicate value. Dictionary and its O(1) keyed retrieval sounds like a good tool for this.

So instead of just writing True into column G, we can store the value of column B into our dictionary, and then the loop can now conditionally evaluate the countif when the dictionary doesn't already contain the value in column B.

Actually, with a dictionary you could make the whole loop much more efficient than that.

Start at row 2, end at lastrow: the number of iterations is known before we even start looping - the loop should be a For loop, not Do While. So we loop i from 2 to N, and at each row we try to add the value of column B into the dictionary. If the value already exists in there (that's O(1); CountIf is O(n)), then we know we have a dupe at that row so we write True to column G. After the loop, column G identifies all dupes, and the dictionary contains all the unique values. Iterating a variant array of values in-memory instead of worksheet cells would be even faster. You'll find the Dictionary class in the Scripting library if you want it early-bound.

But then again, I doubt it would be faster than the native highlight duplicates functionality.

Question 5

The second algorithm with the dictionary does not do the same as the original algorithm; it does not mark the first occurrence of the value occurring multiple times. To do that, the value saved in the dictionary should be whether the value is known to be a dupe. Then this can be used to fill column G on a second pass over column B.

Question 6

@M.Doerner ha, good point! O(2n) is still O(n) though, so that 2-pass sweep should still perform quite decently (obviously the greater n, the more time is takes, but now it's linear not exponential).

Question 7

CountIf iterates through 210000 rows every time and you call it 210000 times. That's 44100000000 iterations.

You'll need to find a better algorithm to compute what you want, ideally iterating only a constant amount of times over each row.

Question 8

Indeed iterating through all the rows is slowing down your code. But what about letting Excel do the work for you by using a formula?

This is what I suggest:

Sub DupValidation()
Dim wb As Workbook
Dim ws1 As Worksheet
Dim i As Long
Dim lastrow As Long
Dim lastrow2 As Long
Set wb = ActiveWorkbook
Set ws1 = wb.Worksheets("Tickets")
lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
With ws1.Range("g2:g" & lastrow)
 .ClearContents
 .FormulaR1C1 = "=IF(COUNTIF(R2C2:RC[-5],RC[-5])>2,TRUE,"""")"
 .FormulaR1C1 = .Value2 ' To get rid off the formula
End With
End Sub

Question 9

An implementation of Mathieu Guindon's response using a dictionary would look something like the below. Note that for the below to work, you need to add the Microsoft Scripting Runtime Reference.

Sub DupValidation()
 Dim wb As Workbook
 Dim ws1 As Worksheet
 Dim i As Long
 Dim lastrow As Long
 Dim lastrow2 As Long
 Set wb = ActiveWorkbook
 Set ws1 = wb.Worksheets("Tickets")
 lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
 ws1.Range("g2:g" & lastrow).ClearContents
 
' i = 2
' Do While i <= lastrow
' If Application.CountIf(ws1.Range(ws1.Cells(2, 2), ws1.Cells(lastrow, 2)), ws1.Cells(i, 2)) > 1 Then
' ws1.Cells(i, 7).Value = True
' End If
' i = i + 1
' Loop
 
 
 Dim dict As New Scripting.Dictionary
 
 For i = 2 To lastrow
 If dict.Exists(ws1.Cells(i, 2)) Then
 ws1.Cells(i, 7).Value = True
 Else
 dict.Add (ws1.Cells(i, 2))
 End If
 Next i
 
End Sub

score 2 · Answer 1 · 2020-06-06 16:01:02Z

Finding (and removing) duplicates is already a built-in functionality; this code is kind of reinventing the wheel. Making VBA code faster than native functionality, from what I've seen, requires thinking outside the box and manipulating memory locations.

But let's review the code we're looking at.

Sub DupValidation()

Kudos for using a meaningful name! This one reads like a noun though; since Sub procedures typically do something, it's common practice to make their names start with a verb, so ValidateDuplicates would be a better name. The procedure is also implicitly Public; consider always using explicit modifiers.

Next we have a handful of declarations, for each of the variables used in the procedure. Again, kudos for explicitly declaring all variables (is Option Explicit specified?) - but consider declaring them when they're needed, as they're needed. That way it's much harder to accidentally leave unused variables behind, like what happened to lastrow2.

Rule of thumb, if there's a digit at the end of a name, it's usually a bad name. ws1 could be sheet, or ticketsSheet.

But you're not here to hear about variable naming are you.

I'm curious why Range("g2:g" & lastrow) is good enough to get a range encompassing all cells in column G, but once inside the loop we switch to a very long-winded Range(Cells, Cells) call instead of just doing Range("B2:B" & lastrow) like we just did.

Do we really need to COUNTIF to identify a 2nd or 3rd (or 250th) duplicate for one value?

If we had a data structure we could quickly lookup a value from, we could put the known-dupe value in it, and then only perform the expensive COUNTIF when we already know we're not looking at a known duplicate value. Dictionary and its O(1) keyed retrieval sounds like a good tool for this.

So instead of just writing True into column G, we can store the value of column B into our dictionary, and then the loop can now conditionally evaluate the countif when the dictionary doesn't already contain the value in column B.

Actually, with a dictionary you could make the whole loop much more efficient than that.

Start at row 2, end at lastrow: the number of iterations is known before we even start looping - the loop should be a For loop, not Do While. So we loop i from 2 to N, and at each row we try to add the value of column B into the dictionary. If the value already exists in there (that's O(1); CountIf is O(n)), then we know we have a dupe at that row so we write True to column G. After the loop, column G identifies all dupes, and the dictionary contains all the unique values. Iterating a variant array of values in-memory instead of worksheet cells would be even faster. You'll find the Dictionary class in the Scripting library if you want it early-bound.

But then again, I doubt it would be faster than the native highlight duplicates functionality.

The second algorithm with the dictionary does not do the same as the original algorithm; it does not mark the first occurrence of the value occurring multiple times. To do that, the value saved in the dictionary should be whether the value is known to be a dupe. Then this can be used to fill column G on a second pass over column B.
@M.Doerner ha, good point! O(2n) is still O(n) though, so that 2-pass sweep should still perform quite decently (obviously the greater n, the more time is takes, but now it's linear not exponential).

D. Jurcau D. Jurcau 3,05013 silver badges21 bronze badges · Answer 2 · 2020-06-06 10:13:53Z

CountIf iterates through 210000 rows every time and you call it 210000 times. That's 44100000000 iterations.

You'll need to find a better algorithm to compute what you want, ideally iterating only a constant amount of times over each row.

user215992 user215992 841 silver badge2 bronze badges · Answer 3 · 2020-06-06 16:57:11Z

Indeed iterating through all the rows is slowing down your code. But what about letting Excel do the work for you by using a formula?

This is what I suggest:

Sub DupValidation()
Dim wb As Workbook
Dim ws1 As Worksheet
Dim i As Long
Dim lastrow As Long
Dim lastrow2 As Long
Set wb = ActiveWorkbook
Set ws1 = wb.Worksheets("Tickets")
lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
With ws1.Range("g2:g" & lastrow)
 .ClearContents
 .FormulaR1C1 = "=IF(COUNTIF(R2C2:RC[-5],RC[-5])>2,TRUE,"""")"
 .FormulaR1C1 = .Value2 ' To get rid off the formula
End With
End Sub

Taylor Raine Taylor Raine 3402 silver badges8 bronze badges · Answer 4 · 2020-06-12 14:41:30Z

An implementation of Mathieu Guindon's response using a dictionary would look something like the below. Note that for the below to work, you need to add the Microsoft Scripting Runtime Reference.

Sub DupValidation()
 Dim wb As Workbook
 Dim ws1 As Worksheet
 Dim i As Long
 Dim lastrow As Long
 Dim lastrow2 As Long
 Set wb = ActiveWorkbook
 Set ws1 = wb.Worksheets("Tickets")
 lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
 ws1.Range("g2:g" & lastrow).ClearContents
 
' i = 2
' Do While i <= lastrow
' If Application.CountIf(ws1.Range(ws1.Cells(2, 2), ws1.Cells(lastrow, 2)), ws1.Cells(i, 2)) > 1 Then
' ws1.Cells(i, 7).Value = True
' End If
' i = i + 1
' Loop
 
 
 Dim dict As New Scripting.Dictionary
 
 For i = 2 To lastrow
 If dict.Exists(ws1.Cells(i, 2)) Then
 ws1.Cells(i, 7).Value = True
 Else
 dict.Add (ws1.Cells(i, 2))
 End If
 Next i
 
End Sub

Stack Exchange Network

Excel VBA optimizing for large data base

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Excel VBA optimizing for large data base

4 Answers 4

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions