I am new to coding and need help with a code that won't complete. I suspect it is due to the size of the data set. I tested the code using a reduced data set and it processes fine. However, my actual data set is over 210,000 rows and is expected to grow. Is there a way to speed this up? Thank you for your assistance
Sub DupValidation()
Dim wb As Workbook
Dim ws1 As Worksheet
Dim i As Long
Dim lastrow As Long
Dim lastrow2 As Long
Set wb = ActiveWorkbook
Set ws1 = wb.Worksheets("Tickets")
lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
ws1.Range("g2:g" & lastrow).ClearContents
i = 2
Do While i <= lastrow
If Application.CountIf(ws1.Range(ws1.Cells(2, 2), ws1.Cells(lastrow, 2)), ws1.Cells(i, 2)) > 1 Then
ws1.Cells(i, 7).Value = True
End If
i = i + 1
Loop
End Sub
4 Answers 4
Finding (and removing) duplicates is already a built-in functionality; this code is kind of reinventing the wheel. Making VBA code faster than native functionality, from what I've seen, requires thinking outside the box and manipulating memory locations.
But let's review the code we're looking at.
Sub DupValidation()
Kudos for using a meaningful name! This one reads like a noun though; since Sub
procedures typically do something, it's common practice to make their names start with a verb, so ValidateDuplicates
would be a better name. The procedure is also implicitly Public
; consider always using explicit modifiers.
Next we have a handful of declarations, for each of the variables used in the procedure. Again, kudos for explicitly declaring all variables (is Option Explicit
specified?) - but consider declaring them when they're needed, as they're needed. That way it's much harder to accidentally leave unused variables behind, like what happened to lastrow2
.
Rule of thumb, if there's a digit at the end of a name, it's usually a bad name. ws1
could be sheet
, or ticketsSheet
.
But you're not here to hear about variable naming are you.
I'm curious why Range("g2:g" & lastrow)
is good enough to get a range encompassing all cells in column G, but once inside the loop we switch to a very long-winded Range(Cells, Cells)
call instead of just doing Range("B2:B" & lastrow)
like we just did.
Do we really need to COUNTIF
to identify a 2nd or 3rd (or 250th) duplicate for one value?
If we had a data structure we could quickly lookup a value from, we could put the known-dupe value in it, and then only perform the expensive COUNTIF
when we already know we're not looking at a known duplicate value. Dictionary
and its O(1) keyed retrieval sounds like a good tool for this.
So instead of just writing True
into column G, we can store the value of column B into our dictionary, and then the loop can now conditionally evaluate the countif when the dictionary doesn't already contain the value in column B.
Actually, with a dictionary you could make the whole loop much more efficient than that.
Start at row 2, end at lastrow: the number of iterations is known before we even start looping - the loop should be a For
loop, not Do While
. So we loop i
from 2 to N, and at each row we try to add the value of column B into the dictionary. If the value already exists in there (that's O(1); CountIf is O(n)), then we know we have a dupe at that row so we write True
to column G. After the loop, column G identifies all dupes, and the dictionary contains all the unique values. Iterating a variant array of values in-memory instead of worksheet cells would be even faster. You'll find the Dictionary
class in the Scripting
library if you want it early-bound.
But then again, I doubt it would be faster than the native highlight duplicates functionality.
-
1\$\begingroup\$ The second algorithm with the dictionary does not do the same as the original algorithm; it does not mark the first occurrence of the value occurring multiple times. To do that, the value saved in the dictionary should be whether the value is known to be a dupe. Then this can be used to fill column G on a second pass over column B. \$\endgroup\$M.Doerner– M.Doerner2020年06月06日 16:49:01 +00:00Commented Jun 6, 2020 at 16:49
-
\$\begingroup\$ @M.Doerner ha, good point! O(2n) is still O(n) though, so that 2-pass sweep should still perform quite decently (obviously the greater n, the more time is takes, but now it's linear not exponential). \$\endgroup\$Mathieu Guindon– Mathieu Guindon2020年06月06日 16:53:21 +00:00Commented Jun 6, 2020 at 16:53
CountIf
iterates through 210000 rows every time and you call it 210000 times. That's 44100000000 iterations.
You'll need to find a better algorithm to compute what you want, ideally iterating only a constant amount of times over each row.
Indeed iterating through all the rows is slowing down your code. But what about letting Excel do the work for you by using a formula?
This is what I suggest:
Sub DupValidation()
Dim wb As Workbook
Dim ws1 As Worksheet
Dim i As Long
Dim lastrow As Long
Dim lastrow2 As Long
Set wb = ActiveWorkbook
Set ws1 = wb.Worksheets("Tickets")
lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
With ws1.Range("g2:g" & lastrow)
.ClearContents
.FormulaR1C1 = "=IF(COUNTIF(R2C2:RC[-5],RC[-5])>2,TRUE,"""")"
.FormulaR1C1 = .Value2 ' To get rid off the formula
End With
End Sub
An implementation of Mathieu Guindon's response using a dictionary would look something like the below. Note that for the below to work, you need to add the Microsoft Scripting Runtime
Reference.
Sub DupValidation()
Dim wb As Workbook
Dim ws1 As Worksheet
Dim i As Long
Dim lastrow As Long
Dim lastrow2 As Long
Set wb = ActiveWorkbook
Set ws1 = wb.Worksheets("Tickets")
lastrow = ws1.Cells(Rows.Count, 1).End(xlUp).Row
ws1.Range("g2:g" & lastrow).ClearContents
' i = 2
' Do While i <= lastrow
' If Application.CountIf(ws1.Range(ws1.Cells(2, 2), ws1.Cells(lastrow, 2)), ws1.Cells(i, 2)) > 1 Then
' ws1.Cells(i, 7).Value = True
' End If
' i = i + 1
' Loop
Dim dict As New Scripting.Dictionary
For i = 2 To lastrow
If dict.Exists(ws1.Cells(i, 2)) Then
ws1.Cells(i, 7).Value = True
Else
dict.Add (ws1.Cells(i, 2))
End If
Next i
End Sub
True
value used for something afterwards? \$\endgroup\$