VBA: Efficiently Organise Data with Missing Values to Achieve Minimum Number of Tables

Question 1

The task is to create sub-tables from an original table (like below) that contain the missing data columns (missing data is highlighted in red). The important part of this task is to make the smallest number of output tables based on the combinations of missing data points across different columns for different rows.

Input Table

In the above example the optimal output will create 4 tables such as the ones that are shown below. These tables include common columns that are missing across different examples.

Output Tables

This is the code that works to output these tables. An interesting element which this code does not cover and why it can be improved is that the Example20 row is correctly split across Table2 and Table4 as they had been created previously (and IsSubset returned True). However, had Example20 appeared earlier in the data set, a table would have been created for it with the two columns that it has missing ("G" and "I") and then we would have ended up with 5 output tables as we would also have had tables that covered I and G individually which is not the optimal solution. We are looking for an optimised code that finds optimal solution consistently regardless of the order of the rows.

Sub SeparateGroupedMissingDataToNewWorkbookOptimised()
 Dim ws As Worksheet
 Dim wbNew As Workbook
 Dim wsOutput As Worksheet
 Dim lastRow As Long, lastCol As Long
 Dim currentRow As Long
 Dim col As Integer
 Dim outputRow As Long
 Dim missingKey As String
 Dim dict As Object
 Dim missingDict As Object
 Dim existingKeys As Variant
 Dim headers() As String
 Dim foundSubset As Boolean
 Dim i As Integer, j As Integer
 
 ' Define input sheet
 Set ws = ThisWorkbook.Sheets("Missing Data")
 
 ' Create a new workbook for output
 Set wbNew = Workbooks.Add
 
 ' Initialise dictionaries to keep track of missing data
 Set dict = CreateObject("Scripting.Dictionary")
 Set missingDict = CreateObject("Scripting.Dictionary")
 
 ' Find the last row and column in the input data
 lastRow = ws.Cells(ws.Rows.Count, 1).End(xlUp).Row
 lastCol = ws.Cells(1, ws.Columns.Count).End(xlToLeft).Column
 
 ' Loop through each row to identify missing data patterns
 For currentRow = 2 To lastRow
 missingKey = ""
 
 ' Construct a key that represents missing columns
 For col = 3 To lastCol ' Start from 3 to skip "Name" and "Code"
 If ws.Cells(currentRow, col).Value = "" Then
 missingKey = missingKey & ws.Cells(1, col).Value & "|"
 End If
 Next col
 
 ' If there is missing data
 If missingKey <> "" Then
 foundSubset = False
 existingKeys = dict.Keys
 
 ' Check all existing patterns for subset matches
 For i = 0 To dict.Count - 1
 Dim existingKey As String
 existingKey = existingKeys(i)
 
 ' Check if current missingKey is a subset of existingKey or vice versa
 If IsSubset(existingKey, missingKey) Then
 foundSubset = True
 ' Output the row to the existing subset sheet
 Set wsOutput = missingDict(existingKey)
 outputRow = wsOutput.Cells(wsOutput.Rows.Count, 1).End(xlUp).Row + 1
 
 wsOutput.Cells(outputRow, 1).Value = ws.Cells(currentRow, 1).Value ' Name
 wsOutput.Cells(outputRow, 2).Value = ws.Cells(currentRow, 2).Value ' Code
 
 ' Fill in the missing columns as blanks
 headers = Split(existingKey, "|")
 For j = 0 To UBound(headers) - 1
 If headers(j) <> "" Then
 For col = 3 To lastCol
 If wsOutput.Cells(1, col).Value = headers(j) Then
 wsOutput.Cells(outputRow, col).Value = "" ' Missing column data
 End If
 Next col
 End If
 Next j
 End If
 Next i
 
 ' If no matching subset found, create a new table for this missing data pattern
 If Not foundSubset Then
 dict.Add missingKey, dict.Count + 1
 
 ' Create a new worksheet for each unique pattern of missing data
 Set wsOutput = wbNew.Sheets.Add(After:=wbNew.Sheets(wbNew.Sheets.Count))
 wsOutput.Name = "Missing_" & dict.Count
 
 ' Add headers to the new sheet
 wsOutput.Cells(1, 1).Value = "Name"
 wsOutput.Cells(1, 2).Value = "Code"
 
 ' Add the column names that are missing
 headers = Split(missingKey, "|")
 For col = 0 To UBound(headers) - 1
 If headers(col) <> "" Then
 wsOutput.Cells(1, col + 3).Value = headers(col)
 End If
 Next col
 
 ' Initialize missing data dictionary
 missingDict.Add missingKey, wsOutput
 
 ' Output the current row to the new sheet
 outputRow = 2
 wsOutput.Cells(outputRow, 1).Value = ws.Cells(currentRow, 1).Value ' Name
 wsOutput.Cells(outputRow, 2).Value = ws.Cells(currentRow, 2).Value ' Code
 
 ' Fill in the missing columns as blanks
 For j = 0 To UBound(headers) - 1
 If headers(j) <> "" Then
 wsOutput.Cells(outputRow, j + 3).Value = "" ' Missing column data
 End If
 Next j
 End If
 End If
 Next currentRow
 
 ' Adjust column widths for better readability in each output sheet
 For Each wsOutput In wbNew.Sheets
 wsOutput.Cells.EntireColumn.AutoFit
 Next wsOutput
 
 MsgBox "Missing data grouped by missing columns in a new workbook!", vbInformation
End Sub
' Function to check if one key is a subset of another
' key1: A string representing a set of missing column names separated by "|"
' key2: Another string representing a different set of missing column names separated by "|"
' Returns: True if key1 is a subset of key2, False otherwise
Function IsSubset(key1 As String, key2 As String) As Boolean
 ' Declare arrays to hold the column names extracted from key1 and key2
 Dim arr1() As String, arr2() As String
 ' Loop variables
 Dim i As Integer, j As Integer
 ' Flag to indicate if a match is found
 Dim found As Boolean
 
 ' Split the keys into arrays of column names using the "|" delimiter
 arr1 = Split(key1, "|") ' Array of missing column names from key1
 arr2 = Split(key2, "|") ' Array of missing column names from key2
 
 ' If key1 has more elements than key2, it cannot be a subset
 If UBound(arr1) > UBound(arr2) Then
 ' key1 should be smaller or equal in length to key2 to be a subset
 IsSubset = False
 Exit Function
 End If
 
 ' Loop through each element in arr1 (representing key1)
 For i = 0 To UBound(arr1) - 1
 ' Skip empty elements caused by trailing or leading delimiters
 If arr1(i) <> "" Then
 found = False ' Reset the found flag for each element in arr1
 ' Loop through each element in arr2 (representing key2)
 For j = 0 To UBound(arr2) - 1
 ' If the current element in arr1 matches an element in arr2
 If arr1(i) = arr2(j) Then
 found = True ' Set the found flag to True
 Exit For ' No need to continue searching in arr2 for this element
 End If
 Next j
 ' If no match is found in arr2 for the current element in arr1
 If Not found Then
 IsSubset = False ' key1 is not a subset of key2
 Exit Function ' Exit the function early
 End If
 End If
 Next i
 
 ' If all elements in arr1 are found in arr2, key1 is a subset of key2
 IsSubset = True
End Function

Question 2

Thanks you for your post. Good job on a fairly complex problem. I noticed that you are also going to post it on StackOverflow. My solution should be done by tomorrow.

Question 3

Thanks @TinMan, appreciate the advice in your answer below and look forward to reading your solution (did I see you post a solution to this on StackOverflow? I just got round to reading it and can't see it anymore!).

Question 4

Not really an answer but too long for a comment

On the algorithm, I'll ignore the point that this isn't doing what you want (isn't guaranteed to give best answer) since that kind of thing is off topic for this site. It would be easy to wrap your code in a loop to check every permutation of columns and rows to see which produces the fewest tables. But that would not be very efficient.

I would recommend thinking about this as a binary matrix where a 1 is a gap, a 0 is data. Your problem is then to find which ordering of rows and columns in the matrix minimises the number of rectangular areas containing only 1s. I imagine by using a binary matrix you'll find some analogy in graph theory that allows you to use a graph search algorithm to find the optimal shape.

https://stackoverflow.com/q/66021340/6609896 https://stackoverflow.com/q/11481868/6609896

Question 5

I misread the post. I'm going to have to think about this a bit.

Question 6

@TinMan Yeah "It should be too difficult" I think is accurate, this is not a straightforward problem I don't think!

Question 7

Difficulty 8.25. I should be able to remove the subset from the superset and get the right result but that would require a the data be sorted each time. I'll rewrite it tomorrow using a binary search.

Question 8

@TinMan uh oh the number is rising! I hope you figure it out it's an interesting problem

Question 9

Uncle! I don't have the credentials (or any credentials for that matter) to solve this problem. Difficult ∞. I think that any solution that I come up with would just be a theory because there could always be an edge case to break it. This seems relevant: Set cover problem. Let me know if you solve it.

Question 10

SeparateGroupedMissingDataToNewWorkbookOptimised() - SGMDTNWO

That is a very long and descriptive name (48 characters). I see why its so long, the subroutine is performing many tasks. I try and keep my procedures under 40 lines for readability. Ideally, a method perform a minimum number of task. The fewer tasks a method that are performed the easier it is to debug and modify.

Breaking the code into multiple procedure will also make it more flexible.

Let's start by extracting the core functionality to a function. We are after all primarily looking at how to create a Data Model.

Function GetMissingDataTables(Data As Variant) as Variant()

This function can have a smaller name because it is doing less tasks. The beauty of it is it makes it far easier and flexible to test. Notice, that we're passing in a data array and not a range. This allows us to add more use cases. For instance, we could port the code to Access.

MDTables = GetMissingDataTables(ws.Range("A1").CurrentRegion)

We can now pass our model to multiple views:

Sub CreateMDWorkbookView(wb as Workbook, MDTables() As Variant)

Passing in the a workbook and model adds more flexibility yet.

Sub CreateMDWorksheetView(ws as WorkSheet, MDTables() As Variant)

Maybe you decide that you need all the tables on a single worksheet, just write another method and pass in you data model.

Consider writing a class to manage your code.

MissingDataReport

• Sub LoadData(Data As Variant)
• Sub CreateWorkbookView(wb as Workbook, MDTables() As Variant)
• Sub CreateWorksheetView(ws as WorkSheet, MDTables() As Variant)
• Function TableCount() as Long
• Function GetTable(TableIndex as Long) As Variant()
• Function GetTables() As Variant()
• Function GetTableWithoutMissingData() As Variant()

When tackling a complex problem, I will scaffold out the code. I probably won't use many of the methods but helps me focus on specific tasks. Doing it this way makes me more productive. If I get stuck a one part, I can work on other parts of the code to clear my head.

Dim i As Integer, j As Integer

If I could, I would edit every post about arrays on the internet to use r and c. It's a little thing but it really make the code easier to read.

Dim r As Long, c As Long

By the way, use Long and not Integer. An Integer max value is only 32,767 where as Long is 2,147,483,647.

Consider:

Dim i As Long, j As Long, k As Long, l As Long

Compared to:

Dim r1 As Long, c1 As Long, r2 As Long, c2 As Long

Magic Numbers

Magic numbers are hard-coded values that are not explained. Magic number should be replaced by consts.

Const FirstDataColumn as Long = 3 For currentRow = FirstDataColumn To lastRow

Key Columns

Having hard coded key columns really limits the use case for the code. Consider using key columns and excluded columns to calculate missing data columns. This would take you code to a whole new level.

Add a Reference to the Microsoft Scripting Runtime

This enables Intellisense to do what it does so well; make us more productive.

Scripting.Dictionary Intellisense

Greedo 2,6352 gold badges15 silver badges36 bronze badges · Answer 1 · 2024-09-13 13:13:18Z

Not really an answer but too long for a comment

On the algorithm, I'll ignore the point that this isn't doing what you want (isn't guaranteed to give best answer) since that kind of thing is off topic for this site. It would be easy to wrap your code in a loop to check every permutation of columns and rows to see which produces the fewest tables. But that would not be very efficient.

I would recommend thinking about this as a binary matrix where a 1 is a gap, a 0 is data. Your problem is then to find which ordering of rows and columns in the matrix minimises the number of rectangular areas containing only 1s. I imagine by using a binary matrix you'll find some analogy in graph theory that allows you to use a graph search algorithm to find the optimal shape.

https://stackoverflow.com/q/66021340/6609896 https://stackoverflow.com/q/11481868/6609896

I misread the post. I'm going to have to think about this a bit.
@TinMan Yeah "It should be too difficult" I think is accurate, this is not a straightforward problem I don't think!
Difficulty 8.25. I should be able to remove the subset from the superset and get the right result but that would require a the data be sorted each time. I'll rewrite it tomorrow using a binary search.
@TinMan uh oh the number is rising! I hope you figure it out it's an interesting problem
Uncle! I don't have the credentials (or any credentials for that matter) to solve this problem. Difficult ∞. I think that any solution that I come up with would just be a theory because there could always be an edge case to break it. This seems relevant: Set cover problem. Let me know if you solve it.

TinMan 4,3281 gold badge13 silver badges25 bronze badges · Answer 2 · 2024-09-14 14:57:33Z

SeparateGroupedMissingDataToNewWorkbookOptimised() - SGMDTNWO

That is a very long and descriptive name (48 characters). I see why its so long, the subroutine is performing many tasks. I try and keep my procedures under 40 lines for readability. Ideally, a method perform a minimum number of task. The fewer tasks a method that are performed the easier it is to debug and modify.

Breaking the code into multiple procedure will also make it more flexible.

Let's start by extracting the core functionality to a function. We are after all primarily looking at how to create a Data Model.

Function GetMissingDataTables(Data As Variant) as Variant()

This function can have a smaller name because it is doing less tasks. The beauty of it is it makes it far easier and flexible to test. Notice, that we're passing in a data array and not a range. This allows us to add more use cases. For instance, we could port the code to Access.

MDTables = GetMissingDataTables(ws.Range("A1").CurrentRegion)

We can now pass our model to multiple views:

Sub CreateMDWorkbookView(wb as Workbook, MDTables() As Variant)

Passing in the a workbook and model adds more flexibility yet.

Sub CreateMDWorksheetView(ws as WorkSheet, MDTables() As Variant)

Maybe you decide that you need all the tables on a single worksheet, just write another method and pass in you data model.

Consider writing a class to manage your code.

MissingDataReport

• Sub LoadData(Data As Variant)
• Sub CreateWorkbookView(wb as Workbook, MDTables() As Variant)
• Sub CreateWorksheetView(ws as WorkSheet, MDTables() As Variant)
• Function TableCount() as Long
• Function GetTable(TableIndex as Long) As Variant()
• Function GetTables() As Variant()
• Function GetTableWithoutMissingData() As Variant()

When tackling a complex problem, I will scaffold out the code. I probably won't use many of the methods but helps me focus on specific tasks. Doing it this way makes me more productive. If I get stuck a one part, I can work on other parts of the code to clear my head.

Dim i As Integer, j As Integer

If I could, I would edit every post about arrays on the internet to use r and c. It's a little thing but it really make the code easier to read.

Dim r As Long, c As Long

By the way, use Long and not Integer. An Integer max value is only 32,767 where as Long is 2,147,483,647.

Consider:

Dim i As Long, j As Long, k As Long, l As Long

Compared to:

Dim r1 As Long, c1 As Long, r2 As Long, c2 As Long

Magic Numbers

Magic numbers are hard-coded values that are not explained. Magic number should be replaced by consts.

Const FirstDataColumn as Long = 3 For currentRow = FirstDataColumn To lastRow

Key Columns

Having hard coded key columns really limits the use case for the code. Consider using key columns and excluded columns to calculate missing data columns. This would take you code to a whole new level.

Add a Reference to the Microsoft Scripting Runtime

This enables Intellisense to do what it does so well; make us more productive.

Scripting.Dictionary Intellisense

Stack Exchange Network

VBA: Efficiently Organise Data with Missing Values to Achieve Minimum Number of Tables

2 Answers 2

SeparateGroupedMissingDataToNewWorkbookOptimised() - SGMDTNWO

Dim i As Integer, j As Integer

Magic Numbers

Key Columns

Add a Reference to the Microsoft Scripting Runtime

You must log in to answer this question.

Hot Network Questions

VBA: Efficiently Organise Data with Missing Values to Achieve Minimum Number of Tables

2 Answers 2

SeparateGroupedMissingDataToNewWorkbookOptimised() - SGMDTNWO

Dim i As Integer, j As Integer

Magic Numbers

Key Columns

Add a Reference to the Microsoft Scripting Runtime

You must log in to answer this question.

Related

Hot Network Questions