I was assigned the connected components labeling (CCL) algorithm and in order to implement the CCL algorithm I first had to implement the Union-Find algorithm.
This is my try at it, any review and tip would be greatly appreciated
classdef UnionFind < handle
properties
PARENT = containers.Map('KeyType', 'double', 'ValueType','any');
end
methods
% Constructor Function
% function obj = UnionFind(items)
% for i = 1:5
% obj.PARENT(items(i)) = items(i);
% end
% end
% Add Item Function
function addItem(obj, itemToAdd)
obj.PARENT(itemToAdd) = itemToAdd;
end
% Find Function
function root = Find(obj, itemToFind)
while (obj.PARENT(itemToFind) ~= itemToFind)
obj.PARENT(itemToFind) = obj.PARENT(obj.PARENT(itemToFind));
itemToFind = obj.PARENT(itemToFind);
end
root = itemToFind;
end
% Union Function
function Union(obj, setOne, setTwo)
obj.PARENT(setOne) = setTwo;
end
end
end
1 Answer 1
I made a scrip that executes your data structure to test it. It doesn't verify correctness, it's just for timing purposes.
tic
uf = UnionFind;
uf.addItem(1);
for ii = 2:10000
uf.addItem(ii);
n = randi(4);
if n == 4 % happens in 25% of cases
uf.Union(ii, uf.Find(randi(ii-1)));
end
if n ~= 1 % happens in 75% of cases
uf.Union(uf.Find(ii), uf.Find(randi(ii-1))); % ii is no longer necessarily the root
end
end
toc
Basically, it adds 10,000 elements to the data structure, taking the union of the new element with a randomly selected prior element in 50% of the cases, and with two prior elements in an additional 25% of cases. This does not exactly match how the data structure will be used in a connected components labeling algorithm (Find
would be called much more frequently), but it gets close.
This script takes about 1.7 s to run using MATLAB Online (R2019b). Making the simple change of replacing the containers.Map
with a normal numeric array:
properties
%PARENT = containers.Map('KeyType', 'double', 'ValueType','any');
PARENT = [];
end
changes the execution time to 0.097 s. containers.Map
is not a very efficient data structure (it's a custom class, much like any user can write, which is obviously going to be more expensive to access than a native type). Assuming that the elements are consecutive integers (as should be the case in the connected component labeling algorithm and any other image processing algorithm that uses Union-Find), there is no advantage whatsoever to using a map, a simple array can be indexed using the element's value.
A second improvement that can be made to the code is in usability: the Union operation should always be made on two roots. It is convenient to have the Union
function find the roots of its two input arguments, and it prevents wrong use (taking the union with a non-root element would produce wrong results in any algorithm using this data structure):
function Union(obj, setOne, setTwo)
%obj.PARENT(setOne) = setTwo;
obj.PARENT(obj.Find(setOne)) = obj.Find(setTwo);
end
We can now simplify our test code to not call Find
at all. An additional improvement is to return the new root
. This is quite common as well, and will also improve our test code:
function root = Union(obj, setOne, setTwo)
root = obj.Find(setTwo);
obj.PARENT(obj.Find(setOne)) = root;
end
Typically the Union operation selects the smaller of the two trees to become the child of the other. It is however not directly clear if the additional storage space and logic required to maintain the tree sizes is worth it, one would have to implement this and compare to be sure. A simple alternative is to always make the root with the smaller index the parent of the other tree. This ensures that, more often than not, it is the larger tree that will be the parent. It could look like this:
function root = Union(obj, setOne, setTwo)
roots = sort([obj.Find(setOne), obj.Find(setTwo)]);
obj.PARENT(roots(1)) = roots(2);
root = roots(1);
This increases runtime by about 40%, so it's not a good change.
Our test code is now simplified to:
tic
uf = UnionFind;
uf.addItem(1);
for ii = 2:10000
root = ii;
uf.addItem(root);
n = randi(4);
if n == 4 % happens in 25% of cases
root = uf.Union(root, randi(ii-1));
end
if n ~= 1 % happens in 75% of cases
root = uf.Union(root, randi(ii-1));
end
end
toc
Regarding the use of handle
as a base class: this converts the class into a "handle class", objects of this type cannot be copied, all "copies" refer to the same underlying data. This leads at times to unexpected behavior in MATLAB, where typically b=a, b(1)=0
does not not change a
. However I think in this case it is reasonable to use a handle class for this data structure, for the same reasons that MATLAB's containers.Map
is a handle class.
Path compression. Union-Find data structures do not become truly efficient until one implements path compression. This means that trees remain totally flat, with all leaves pointing directly at the root. Find
in this case is usually an O(1) operation, rather than O(log n), converting the CLL algorithm that uses this data structure from O(n log n) to something that is nearly O(n).
Path compression is typically implemented using recursion. Implementing it without recursion is quite awkward. But MATLAB's function calls are still quite expensive, so it is not directly clear that this would be an improvement. Again, one would need to implement it and compare to know for sure.
The code in the OP uses tree halving instead. This is a reasonable compromise because it can be implemented without recursion.
Path compression would look like this:
function root = Find(obj, item)
root = obj.PARENT(item);
if root ~= item
root = obj.Find(root);
obj.PARENT(item) = root;
end
end
This increases the test code running time by about 15%, so it is not a good change.
Finally:
A variable name in all caps (
PARENT
) looks strange to me, using all lowercase letters would be more in line with customary MATLAB style.The constructor is commented out. You don't need this constructor in most cases, you could either delete it altogether, or uncomment it. A block of commented out code can be confusing.
There is no help text. A comment block at the top of the file should explain the purpose of the class and how to use it. Typing
help UnionFind
in MATLAB will show this comment block. Likewise, each function should have some documentation as a comment block either just above or just below thefunction
line.