This code compares data from two spreadsheets generated from our clinic's software. The goal is to eliminate all the rows from the PaymentsSheet
(our most recent transactions) that are duplicated on the InvoicesSheet
(already uploaded transactions) to avoid duplicating info when we upload it to our account software, matching multiple columns to confirm that it is an actual duplicate.
The sheets have a few thousand rows each and will only be getting bigger. Right now it's moving very, very slowly so I'm trying to make my code more efficient with very little success. Any suggestions you have to speed it up or other areas I can improve on would be greatly appreciated. I've included the whole piece just in case but the bottom section with the loops seems to be the big problem area.
Sub cleanInvoices()
Application.EnableEvents = False
ActiveSheet.DisplayPageBreaks = False
Application.ScreenUpdating = False
Dim PaymentsWorkbook As Workbook
Set PaymentsWorkbook = ThisWorkbook
Dim PaymentsSheet As Worksheet
Set PaymentsSheet = PaymentsWorkbook.Sheets(4)
Dim wb As Workbook
For Each wb In Application.Workbooks
wb.Save
Next wb
Dim fileLocation As Variant
fileLocation = Application.GetOpenFilename(Title:="Please choose a Excel File to Open", MultiSelect:=False)
If VarType(fileLocation) = vbBoolean Then
MsgBox "No file selected, Please rerun macro", vbExclamation, "No File Selected!"
Exit Sub
End If
Dim InvoicesWorkbook As Workbook
Set InvoicesWorkbook = Workbooks.Open(fileLocation)
Dim InvoicesSheet As Worksheet
Set InvoicesSheet = InvoicesWorkbook.Sheets(1)
Dim InvoiceRow As Integer
Dim PaymentsRow As Integer
Dim lastCellinPaymentsRow As Integer
Dim lastCellinInvoicesSheet As Integer
Dim lastCellinPaymentsSheet As Integer
Dim lastCellinInvoicesRow As Integer
Dim numMatches As Integer
'clears blank rows on sheets (This part seems to move fairly quickly)
PaymentsSheet.Activate
lastCellinPaymentsSheet = Cells(Rows.Count, 1).End(xlUp).Row
PaymentsSheet.Range("A2").Select
PaymentsSheet.Range(Selection, Selection.End(xlDown)).Select
lastCellinPaymentsRow = Selection.Rows.Count + 1
While (lastCellinPaymentsRow <> lastCellinPaymentsSheet)
PaymentsSheet.Range("A" & lastCellinPaymentsRow + 1, "G" & lastCellinPaymentsRow + 1) = PaymentsSheet.Range("A" & lastCellinPaymentsRow, "G" & lastCellinPaymentsRow)
PaymentsSheet.Range("A2").Select
PaymentsSheet.Range(Selection, Selection.End(xlDown)).Select
lastCellinPaymentsRow = Selection.Rows.Count + 1
Wend
PaymentsSheet.Range("A2").Select
PaymentsSheet.Range(Selection, Selection.End(xlDown)).Select
lastCellinPaymentsRow = Selection.Rows.Count
InvoicesSheet.Activate
lastCellinInvoicesSheet = Cells(Rows.Count, 4).End(xlUp).Row
InvoicesSheet.Range("D2").Select
InvoicesSheet.Range(Selection, Selection.End(xlDown)).Select
lastCellinInvoicesRow = Selection.Rows.Count + 1
While (lastCellinInvoicesRow < lastCellinInvoicesSheet)
InvoicesSheet.Range("B" & lastCellinInvoicesRow + 1, "F" & lastCellinInvoicesRow + 1) = InvoicesSheet.Range("B" & lastCellinInvoicesRow, "F" & lastCellinInvoicesRow)
InvoicesSheet.Range("D2").Select
InvoicesSheet.Range(Selection, Selection.End(xlDown)).Select
lastCellinInvoicesRow = Selection.Rows.Count
Wend
'compares sheets and deletes appropriate rows (This part is where it starts to run very slow and lock up)
PaymentsSheet.Activate
For p = 2 To lastCellinPaymentsRow
For i = 2 To lastCellinInvoicesRow
PaymentsSheet.Application.StatusBar = "Completed " & p & " of " & lastCellinPaymentsRow
If StrComp(PaymentsSheet.Cells(p, 1).Value, InvoicesSheet.Cells(i, 4).Value) = 0 Then
If StrComp(PaymentsSheet.Cells(p, 2).Value, InvoicesSheet.Cells(i, 25).Value) = 0 Then
If StrComp(PaymentsSheet.Cells(p, 4).Value, InvoicesSheet.Cells(i, 5).Value) = 0 Then
If StrComp(PaymentsSheet.Cells(p, 7).Value, InvoicesSheet.Cells(i, 42).Value) = 0 Then
numMatches = numMatches + 1
PaymentsSheet.Cells(p, 7).EntireRow.Delete
End If
End If
End If
End If
Next i
Next p
PaymentsSheet.Application.ScreenUpdating = True
End Sub
2 Answers 2
You asked primarily for some help improving the performance of your code. So I will start with that. Then, I will go on with a few things that can help to generally improve the code.
Performance
Context switching
One if the primary things slowing down code looping over cell ranges is the constant context switching between the VBA and Excel. It is usually preferable to read the entire relevant data into memory, then iterate over that and finally write data in one operation back to the file. In your case, the last part would not really consist of writing data, but deleting rows.
You can read the contents of a Range into a 2d Variant array using the Value member. The indexing of the array works the same as with Cells
(or actually the implicit call to Item
).
So, to get all values from a sheet wks
into a variable data
declared as Variant
, you can simply use data = wks.UsedRange.Value
.
Reading in both the payment sheet and invoice sheet this way, allows you to iterate over both without switching back to Excel for every value.
Reference key lookups
To determine whether a corresponding record exists in the invoice sheet, your code iterates over the entire invoice sheet each time. If there are only a few hundred items in there, that can really be ok. However, should there be more, you might want to generate a joined key by concatenation the strings with a distinct separator and store them in a dictionary for fast existence checks.
Correctness
Deleting in one go
If I see this correctly, your code actually has a typical bug you might not have realized it has. When you delete a row, you skip the next one.
Let me explain the issue. When you delete a row, all row indices of the rows below are adjusted by subtracting one. This means that the row index of the one after the just deleted row is then the same as the one of the deleted row. Your code then checks the row with the next higher row index, which is the one two after the deleted one.
To avoid this, you can in principle iterate in descending order, or just delete in descending order, if you follow the advice to read out all data and determine the rows to delete from there. However, I heard somewhere that this approach sometimes has issues, too, I think with filters.
One way to delete all the rows at once is to use the Application.Union
method to construct a range containing all rows to delete and then to call Delete
on it.
Type for row numbers
In your code, the variables holding row numbers are of type Integer
. That is a bad idea. In VBA the largest Integer
is 32767, but a sheet in modern Excel can have millions of rows. You should really use Long
instead.
General
Single responsibility principle
This principle of good software design says that a unit of work, e.g. a procedure, should be responsible for just one thing at once. More practically, it should only have to change for a single reason.
Your procedure has a lot of responsibilities: it saves all workbooks; it asks the user for the invoice filename; it gets the sheets; it removes blank rows ; it removes payment rows found in the invoice sheet.
All of these responsibilities could be their own procedure or function taking appropriate parameters as needed. Then, these could be called from a coordinating procedure.
The result of such a separation is that the code is much simpler to read since the coordinating procedure reads like a table of contents and individual procedures are more focused. Moreover, with the separation, the code is easier to debug, because it is clear which part dies what.
Selecting and activating
It is next to never necessary to activate a sheet or select a range in VBA code. Instead, the references should simply be saved in correspondingly typed variables. This has the neet side-effect that one does not mess with the users selection.
Advanced
As an advanced suggestion, I would like to present a way to guarantee that application settings like ScreenUpdating
are reset at the end of a procedure, no matter what.
Since VBA has deterministic object lifetimes you can follow the RAII (resource allocation is instantiation) principle known from C++. This encapsulates resource allocation or, in extension, global settings changes in objects that get destroyed when they fall out of scope, e.g. at the end of a procedure.
Let my explain how that works for the ScreenUpdating
settings. You define a new class module, say NoScreenUpdatingContainer
and insert the following code.
Option Explicit
Private Sub initialState as Boolean
Private Sub Class_Initiate()
initialState = Application.ScreenUpdating
Application.ScreenUpdating = False
End Sub
Private Sub Class_Terminate()
Application.ScreenUpdating = initialState
End Sub
When you bind an instance of this class to a variable, it will deactivate screen updating. Once the variable goes out of scope, the initial state will be restored.
-
\$\begingroup\$ Do you have any idea why context switching is slow? Why it's not just a matter of reading an array of values from Excel into VBA? \$\endgroup\$Greedo– Greedo2021年11月02日 11:48:09 +00:00Commented Nov 2, 2021 at 11:48
One area of 'opportunity' for optimization is that the nested loops are not 'short-circuited' once a match is found. So, even though a matching payment/invoice pair have been found, the code continues to search through the remainder of the invoices every time. Also, the current code deletes a payment row as soon as it finds a match. Which means that the loop is going to operate past the rows of interest if any payment rows are deleted.
The code modifications below focus on improving the last portion with the nested loops within a dedicated method. It makes use of Dictionary
classes to help organize the comparisons and 'short circuiting' the interior loop. Also, rather than working through nested If
statements with StrComp
evaluations, the code sets up a single string comparison for each row.
The nested loops span too many lines of code in both the original form and in this answer. The modified code will definitely not win any 'style' points. The nested loop code should be refactored to call helper functions and reduce the nesting in order to make it easier to read. That said, it was left in this form thinking that, at least initially, it may be easier to compare with the original code.
'Collate/organize the set of strings to compare
Private Type TComparisons
First As String
Second As String
Third As String
Fourth As String
End Type
Sub cleanInvoices()
Dim originalScreenUpdateSetting As Boolean
originalScreenUpdateSetting = Application.ScreenUpdating
Application.EnableEvents = False
ActiveSheet.DisplayPageBreaks = False
Application.ScreenUpdating = False
'Ensure that Application.EnableEvents is reset in case of an error
On Error GoTo ErrorExit
<original code - unchanged>
'compares sheets and deletes appropriate rows (This part is where it starts to run very slow and lock up)
PaymentsSheet.Activate
EvaluateMatches PaymentsSheet, InvoicesSheet
ErrorExit:
'Note: the other flags modified at the to of the Subroutine are left as-is..intentional?
PaymentsSheet.Application.ScreenUpdating = originalScreenUpdateSetting
End Sub
Private Sub EvaluateMatches(ByVal PaymentsSheet As Worksheet, ByVal InvoicesSheet As Worksheet)
Dim toCompare As TComparisons
'use a Dictionary to cache the Invoice compare strings
Dim invoices As Dictionary
Set invoices = New Dictionary
'Store the rows to delete rather than actually deleting the rows within the nested loops
Dim rowsToDelete As Collection
Set rowsToDelete = New Collection
'Declaring a Dictionary to leverage the "Exists" member
Dim invoiceRowsMatched As Dictionary
Set invoiceRowsMatched = New Dictionary
Dim pmtCompareString As String
For p = 2 To lastCellinPaymentsRow
With PaymentsSheet
toCompare.First = .Cells(p, 1).Value
toCompare.Second = .Cells(p, 2).Value
toCompare.Third = .Cells(p, 4).Value
toCompare.Fourth = .Cells(p, 7).Value
End With
pmtCompareString = MakeSingleCompareString(toCompare)
For i = 2 To lastCellinInvoicesRow
'Load up the invoices Dictionary during the first iteration
If p = 2 Then
With InvoicesSheet
toCompare.First = .Cells(i, 4).Value
toCompare.Second = .Cells(i, 25).Value
toCompare.Third = .Cells(i, 5).Value
toCompare.Fourth = .Cells(i, 42).Value
End With
invoices.Add i, MakeSingleCompareString(toCompare)
End If
'Only look at invoices that have not been matched
If Not invoiceRowsMatched.Exists(i) Then
If StrComp(pmtCompareString , invoices(i)) = 0 Then
rowsToDelete.Add p
invoiceRowsMatched.Add i, i
'Short ciruit the inner loop since a match has been found
i = lastCellinInvoicesRow
End If
End If
Next i
Next p
Dim rowToDelete As Variant
Dim rowToDeleteOffset As Long
rowToDeleteOffset = 0
For Each rowToDelete In rowsToDelete
PaymentsSheet.Cells(rowToDelete - rowToDeleteOffset , 1).EntireRow.Delete
rowToDeleteOffset = rowToDeleteOffset + 1
Next
End Sub
Private Function MakeSingleCompareString(toCompare As TComparisons) As String
MakeSingleCompareString = Trim$(toCompare.First) & "|" & Trim$(toCompare.Second) & "|" & Trim$(toCompare.Third) & "|" & Trim$(toCompare.Fourth)
End Function
```
-
\$\begingroup\$ I think the way the rows are deleted here is not correct. The problem is that deleting a row will change the row indices. You either have to delete them in descending order, which leaves the row induces of the still to be deleted rows intact, or delete them all at once. \$\endgroup\$M.Doerner– M.Doerner2021年11月01日 20:51:29 +00:00Commented Nov 1, 2021 at 20:51
-
\$\begingroup\$ @M.Doerner is correct that deleting the rows as originally posted is a bug. The intent was to make finding the rows to delete 'one responsibility' and the actual deletion of the rows a
different responsibility
. So, the deletion loop has been modified to ensure rows are deleted in descending order. The original code contained a bug that would skip a matching payment row if it directly followed a deleted payment row. \$\endgroup\$BZngr– BZngr2021年11月01日 21:51:51 +00:00Commented Nov 1, 2021 at 21:51