I am trying to create the algorithm described in the image from this link Graph coloring problem
Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a possibility to supress tuples that make two adjacent nodes disjoint then we do that (after exploring all possibilities) and we color the two nodes with different colors then we move to the next node in the graph . The coloring
method returns true
if all nodes were colored, else false
.
The initial graph (all nodes of the graph are contained and passed through a vector) starts with for each node a group of tuples (datasets)
// Graph Coloring (using a Vector for the mapping of the nodes)
public boolean coloring(SparkSession sparksession, Dataset<Row> dF, Graph graph, Iterator<Node> nodeIterator, Vector<Node.Pair> vector) {
Set<Node> EmptySet = Collections.emptySet();
if (vector.size() == graph.adjacencyList.size())
return true;
// Returns all clusters (the initial tuples (nodes) of graph a) : see attached picture)
ArrayList<ArrayList<Dataset<Row>>> allClusters = clusters(sparksession, dF, constraints);
Node nodeIt ;
if (nodeIterator.hasNext())
nodeIt = nodeIterator.next();
else {
return false;
}
// Select the cluster
ArrayList<Dataset<Row>> cluster = allClusters.get(Integer.parseInt(nodeIt.name));
if (graph.getNeighbors(nodeIt) == null || graph.getNeighbors(nodeIt) == EmptySet) {
// if node has no neighbors or graph has one node only
if (!nodeIterator.hasNext()){
colorNode(vector, nodeIt, cluster.get(0));
return false;
}
else {
colorNode(vector, nodeIt, cluster.get(0));
nodeIterator.next();
}
}
Iterable<Node> adjNodes = graph.getNeighbors(nodeIt);
Iterator<Node> adjNodesIt = adjNodes.iterator();
while (adjNodesIt.hasNext()){
Node adjNode = adjNodesIt.next();
if (!checkNodeColored(vector, adjNode)) {
ArrayList<Dataset<Row>> adjCluster = allClusters.get(Integer.parseInt(adjNode.name));
for (Dataset<Row> subCluster : cluster) {
for (Dataset<Row> subAdjCluster : adjCluster) {
// small datasets (tuples of rows) don't intersect
if (datasetIntersect(sparksession, subCluster, subAdjCluster)) {
if (coloring(sparksession, dF, graph, nodeIterator, vector, constraints)) {
return true;
} else {
vector.remove(vector.size() - 1);
}
}
}
}
} else if (!adjNodesIt.hasNext()) {
for (Dataset<Row> ss : cluster) {
// Color last node anyway
colorNode(vector, nodeIt, ss);
}
return true;
}
}
return false;
}
public static void colorNode(Vector<Node.Pair> vector, Node node, Dataset<Row> subCluster) {
Random random = new Random();
int nextInt = random.nextInt(0xffffff + 1);
String colorCode = String.format("#%06x", nextInt);
String newColor = new String(colorCode);
HashMap<String, Dataset<Row>> clusterColor = new HashMap<String, Dataset<Row>>();
clusterColor.put(newColor, subCluster);
Pair vectorPair = new Pair(node, clusterColor);
vector.add(vectorPair);
}
I want to know if the coloring method makes sense and if it perfectly implements the algorithm in attached figure especially in the double for loops and if there isn't a way to optimize that and browse all the nodes to process without redundancy of check. PS : i cannot share all of the code here so any question about my approach is welcome ! Thanks.
1 Answer 1
coloring
- Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal state, but it doesn't look like that's the case. If it doesn't need an instance of whatever class it's part of, making it
static
might be better - That said, maybe making the coloring into a class with more internal state would make the method interfaces a bit easier to understand. There's a lot of information getting passed around which feels like it could be bundled together into an object somehow - but from this code alone it's a bit hard to tell how
- Mixing
Vector
andArrayList
is a bit unusual. Their common subtypeList
would seem to be specific enough - I might find it more natural to have the coloring (
vector
) be the return value of this function, rather than having it be an "out parameter" of sorts - I assume the idea is that people might in fact care about what the coloring looks like, not just whether one exists? - It might be more convenient to make this function
private
and have apublic
function which just calls this one but passes a new (empty) vector at the start. Or do you expect people will want to call this function with a non-emptyvector
? - I'm not sure if re-calculating
allClusters
on each recursive call is the best approach - it shouldn't change over the course of the call, right? In that case, might it be better to take that as a method parameter instead ofdF
andconstraints
(and maybe evensparkSession
)?
colorNode
- Creating a new
Random
each call is inefficient and could make the randomness more predictable. I'd suggest moving theRandom
out of this function and into astatic
variable or something - Is assigning a color code actually important at this point? Might it not be better to first figure out whether a coloring exists, and only assign color codes once we actually have a complete coloring?
Other notes
- Parsing node names as ints and then using those names to index a list feels pretty roundabout - either the name is guaranteed to be
int
-shaped (in which case it should be anint
, not aString
), or it isn't (in which case using it as an index into anArrayList
is wrong). Wouldn'tMap<String, Collection<Dataset<Row>>
, or evenMap<Node, Collection<Dataset<Row>>
feel more natural? - What does it mean when
graph.getNeighbors
return
snull
? I can't see how it'd be meaningfully different from having it return an emptySet
- am I missing something obvious?
Nitpicks
- An identity comparison to
Collections.emptySet()
seems fragile - isGraph::getNeighbors
really guaranteed to return that particular empty set, or is it just guaranteed to return an empty set? You probably wantgraph.getNeighbors(nodeIt).isEmpty()
instead - In
coloring
, when handling the case where a node has no neighbors, the call tocolorNode
is repeated exactly in two different branches of anif
- we might want to move that outside - Commented-out code doesn't benefit humans or computers, it just takes up space and should be deleted
- It feels a bit weird how almost every place refers to
Node.Pair
by its full name, but one line refers to it simply asPair
- that seemed deliberate enough that I actually thought they were two different classes at first. Being consistent about how to refer to things, especially within a single function, does make the logic a bit easier to follow
Naming
- Variable names shouldn't describe how the data is shaped but rather the data's purpose. A name like
nodeIterator
tells me nothing that the type declarationIterator<Node>
doesn't. Thevector
parameter in particular seems to have interesting traits and a clear purpose (if I'm understanding the code correctly, it's a partial coloring, a valid coloring of some subgraph of the graph we're working on), but its name tells readers absolutely none of that - a name likepartialColoring
orcoloringCandidate
or something similar would make that much clearer - Consistency matters. Having
nodeIt
be aNode
butadjNodesIt
be anIterator<Node>
is confusing. CallingnodeIt
something likecurrentNode
would make the naming more consistent and make that variable's purpose a bit clearer - When there's a thing called
x
and a thing calledsubX
, I think most readers would usually expectsubX
to have the same shape asx
, only smaller (like how a subset is-a set). But heresubCluster
is not a smallercluster
but an element of it - their shapes are different enough they even have different data types. I think replacing the namessubCluster, cluster
withcluster, nodeData
, or evencluster, clusters
(and doing the equivalent forsubAdjCluster, adjCluster
) might communicate the relationship between those variables more clearly
-
\$\begingroup\$ Thanks Sarah for your comments, but what i am looking for is rather comments on the algorithm implementation (basically the coloring method) how to color each node based on the tuples intersections, how to move to the next node and when etc... am i maybe processing nodes many times, is there a more optimized approach to browse the graph, the limit cases (the process of the first and the last nodes) these kind of stuff you know ? You could not understand the algorithm from what the figure shows and from what i described in my question ? is it not clear enough ? \$\endgroup\$John Campbell– John Campbell2021年10月06日 16:12:19 +00:00Commented Oct 6, 2021 at 16:12
Dataset
,Row
andSparkSession
are all fromorg.apache.spark.sql
, whileGraph
andNode
are unrelated to the Spark classes with those names - is that right? Is there any relationship between thePair
andNode.Pair
classes? Are the methods to review part of one of the classes they mention, and if so, which? \$\endgroup\$coloring
is the one to review in this case. thanks! \$\endgroup\$coloring
andcolorNode
part of one of theGraph
,Pair
andNode
classes, or do they exist somewhere else? \$\endgroup\$