Graph coloring problem with Spark (JAVA)?

Question 1

I am trying to create the algorithm described in the image from this link Graph coloring problem Simply put any two adjacent nodes must not share tuples and if they do they are colored with the same color. If there's a possibility to supress tuples that make two adjacent nodes disjoint then we do that (after exploring all possibilities) and we color the two nodes with different colors then we move to the next node in the graph . The coloring method returns true if all nodes were colored, else false.

The initial graph (all nodes of the graph are contained and passed through a vector) starts with for each node a group of tuples (datasets)

// Graph Coloring (using a Vector for the mapping of the nodes)
 public boolean coloring(SparkSession sparksession, Dataset<Row> dF, Graph graph, Iterator<Node> nodeIterator, Vector<Node.Pair> vector) {
 Set<Node> EmptySet = Collections.emptySet();
 if (vector.size() == graph.adjacencyList.size())
 return true;
 
 // Returns all clusters (the initial tuples (nodes) of graph a) : see attached picture)
 ArrayList<ArrayList<Dataset<Row>>> allClusters = clusters(sparksession, dF, constraints);
 
 Node nodeIt ;
 if (nodeIterator.hasNext())
 nodeIt = nodeIterator.next();
 else {
 return false;
 }
 // Select the cluster 
 ArrayList<Dataset<Row>> cluster = allClusters.get(Integer.parseInt(nodeIt.name));
 if (graph.getNeighbors(nodeIt) == null || graph.getNeighbors(nodeIt) == EmptySet) {
 // if node has no neighbors or graph has one node only
 if (!nodeIterator.hasNext()){
 colorNode(vector, nodeIt, cluster.get(0));
 return false;
 }
 else {
 colorNode(vector, nodeIt, cluster.get(0));
 nodeIterator.next();
 }
 }
 Iterable<Node> adjNodes = graph.getNeighbors(nodeIt);
 Iterator<Node> adjNodesIt = adjNodes.iterator();
 while (adjNodesIt.hasNext()){
 
 Node adjNode = adjNodesIt.next();
 if (!checkNodeColored(vector, adjNode)) {
 ArrayList<Dataset<Row>> adjCluster = allClusters.get(Integer.parseInt(adjNode.name));
 for (Dataset<Row> subCluster : cluster) {
 for (Dataset<Row> subAdjCluster : adjCluster) {
 // small datasets (tuples of rows) don't intersect
 if (datasetIntersect(sparksession, subCluster, subAdjCluster)) {
 
 
 if (coloring(sparksession, dF, graph, nodeIterator, vector, constraints)) {
 return true;
 } else {
 vector.remove(vector.size() - 1);
 }
 }
 }
 }
 } else if (!adjNodesIt.hasNext()) {
 for (Dataset<Row> ss : cluster) {
 
 // Color last node anyway
 
 colorNode(vector, nodeIt, ss);
 
 }
 return true;
 }
 
 }
 return false;
 }
public static void colorNode(Vector<Node.Pair> vector, Node node, Dataset<Row> subCluster) {
 Random random = new Random();
 int nextInt = random.nextInt(0xffffff + 1);
 String colorCode = String.format("#%06x", nextInt);
 String newColor = new String(colorCode);
 
 HashMap<String, Dataset<Row>> clusterColor = new HashMap<String, Dataset<Row>>();
 clusterColor.put(newColor, subCluster);
 Pair vectorPair = new Pair(node, clusterColor);
 vector.add(vectorPair);
 }

I want to know if the coloring method makes sense and if it perfectly implements the algorithm in attached figure especially in the double for loops and if there isn't a way to optimize that and browse all the nodes to process without redundancy of check. PS : i cannot share all of the code here so any question about my approach is welcome ! Thanks.

Question 2

It looks like Dataset, Row and SparkSession are all from org.apache.spark.sql, while Graph and Node are unrelated to the Spark classes with those names - is that right? Is there any relationship between the Pair and Node.Pair classes? Are the methods to review part of one of the classes they mention, and if so, which?

Question 3

yes it is right. and yes there's a relationship check my update for the Node Class. I don't fully understand your last question, but basically the method coloring is the one to review in this case. thanks!

Question 4

Sorry, I did phrase that last question a bit weirdly - what I meant was, are coloring and colorNode part of one of the Graph, Pair and Node classes, or do they exist somewhere else?

Question 5

no they are in another class.

Question 6

Welcome to the Code Review Community. Have you tested the code to see that it is doing what it is supposed to. I've noticed that this question is also posted on Stack Overflow. The two different sites have very different goals, Stack Overflow helps you debug code. Code Review is for code that is working as expected and the goal is to help you improve your coding skills. You might want to read How do I ask a good question?.

Question 7

`coloring`

Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal state, but it doesn't look like that's the case. If it doesn't need an instance of whatever class it's part of, making it static might be better
That said, maybe making the coloring into a class with more internal state would make the method interfaces a bit easier to understand. There's a lot of information getting passed around which feels like it could be bundled together into an object somehow - but from this code alone it's a bit hard to tell how
Mixing Vector and ArrayList is a bit unusual. Their common subtype List would seem to be specific enough
I might find it more natural to have the coloring (vector) be the return value of this function, rather than having it be an "out parameter" of sorts - I assume the idea is that people might in fact care about what the coloring looks like, not just whether one exists?
It might be more convenient to make this function private and have a public function which just calls this one but passes a new (empty) vector at the start. Or do you expect people will want to call this function with a non-empty vector?
I'm not sure if re-calculating allClusters on each recursive call is the best approach - it shouldn't change over the course of the call, right? In that case, might it be better to take that as a method parameter instead of dF and constraints (and maybe even sparkSession)?

colorNode

Creating a new Random each call is inefficient and could make the randomness more predictable. I'd suggest moving the Random out of this function and into a static variable or something
Is assigning a color code actually important at this point? Might it not be better to first figure out whether a coloring exists, and only assign color codes once we actually have a complete coloring?

Other notes

Parsing node names as ints and then using those names to index a list feels pretty roundabout - either the name is guaranteed to be int-shaped (in which case it should be an int, not a String), or it isn't (in which case using it as an index into an ArrayList is wrong). Wouldn't Map<String, Collection<Dataset<Row>>, or even Map<Node, Collection<Dataset<Row>> feel more natural?
What does it mean when graph.getNeighbors returns null? I can't see how it'd be meaningfully different from having it return an empty Set - am I missing something obvious?

Nitpicks

An identity comparison to Collections.emptySet() seems fragile - is Graph::getNeighbors really guaranteed to return that particular empty set, or is it just guaranteed to return an empty set? You probably want graph.getNeighbors(nodeIt).isEmpty() instead
In coloring, when handling the case where a node has no neighbors, the call to colorNode is repeated exactly in two different branches of an if - we might want to move that outside
Commented-out code doesn't benefit humans or computers, it just takes up space and should be deleted
It feels a bit weird how almost every place refers to Node.Pair by its full name, but one line refers to it simply as Pair - that seemed deliberate enough that I actually thought they were two different classes at first. Being consistent about how to refer to things, especially within a single function, does make the logic a bit easier to follow

Naming

Variable names shouldn't describe how the data is shaped but rather the data's purpose. A name like nodeIterator tells me nothing that the type declaration Iterator<Node> doesn't. The vector parameter in particular seems to have interesting traits and a clear purpose (if I'm understanding the code correctly, it's a partial coloring, a valid coloring of some subgraph of the graph we're working on), but its name tells readers absolutely none of that - a name like partialColoring or coloringCandidate or something similar would make that much clearer
Consistency matters. Having nodeIt be a Node but adjNodesIt be an Iterator<Node> is confusing. Calling nodeIt something like currentNode would make the naming more consistent and make that variable's purpose a bit clearer
When there's a thing called x and a thing called subX, I think most readers would usually expect subX to have the same shape as x, only smaller (like how a subset is-a set). But here subCluster is not a smaller cluster but an element of it - their shapes are different enough they even have different data types. I think replacing the names subCluster, cluster with cluster, nodeData, or even cluster, clusters (and doing the equivalent for subAdjCluster, adjCluster) might communicate the relationship between those variables more clearly

Question 8

Thanks Sarah for your comments, but what i am looking for is rather comments on the algorithm implementation (basically the coloring method) how to color each node based on the tuples intersections, how to move to the next node and when etc... am i maybe processing nodes many times, is there a more optimized approach to browse the graph, the limit cases (the process of the first and the last nodes) these kind of stuff you know ? You could not understand the algorithm from what the figure shows and from what i described in my question ? is it not clear enough ?

Sara J Sara J 4,21112 silver badges37 bronze badges · Answer 1 · 2021-10-06 16:06:34Z

`coloring`

Without seeing the rest of the class, I have no idea why this needs to not be static - maybe one of the functions it calls interacts with some internal state, but it doesn't look like that's the case. If it doesn't need an instance of whatever class it's part of, making it static might be better
That said, maybe making the coloring into a class with more internal state would make the method interfaces a bit easier to understand. There's a lot of information getting passed around which feels like it could be bundled together into an object somehow - but from this code alone it's a bit hard to tell how
Mixing Vector and ArrayList is a bit unusual. Their common subtype List would seem to be specific enough
I might find it more natural to have the coloring (vector) be the return value of this function, rather than having it be an "out parameter" of sorts - I assume the idea is that people might in fact care about what the coloring looks like, not just whether one exists?
It might be more convenient to make this function private and have a public function which just calls this one but passes a new (empty) vector at the start. Or do you expect people will want to call this function with a non-empty vector?
I'm not sure if re-calculating allClusters on each recursive call is the best approach - it shouldn't change over the course of the call, right? In that case, might it be better to take that as a method parameter instead of dF and constraints (and maybe even sparkSession)?

colorNode

Creating a new Random each call is inefficient and could make the randomness more predictable. I'd suggest moving the Random out of this function and into a static variable or something
Is assigning a color code actually important at this point? Might it not be better to first figure out whether a coloring exists, and only assign color codes once we actually have a complete coloring?

Other notes

Parsing node names as ints and then using those names to index a list feels pretty roundabout - either the name is guaranteed to be int-shaped (in which case it should be an int, not a String), or it isn't (in which case using it as an index into an ArrayList is wrong). Wouldn't Map<String, Collection<Dataset<Row>>, or even Map<Node, Collection<Dataset<Row>> feel more natural?
What does it mean when graph.getNeighbors returns null? I can't see how it'd be meaningfully different from having it return an empty Set - am I missing something obvious?

Nitpicks

An identity comparison to Collections.emptySet() seems fragile - is Graph::getNeighbors really guaranteed to return that particular empty set, or is it just guaranteed to return an empty set? You probably want graph.getNeighbors(nodeIt).isEmpty() instead
In coloring, when handling the case where a node has no neighbors, the call to colorNode is repeated exactly in two different branches of an if - we might want to move that outside
Commented-out code doesn't benefit humans or computers, it just takes up space and should be deleted
It feels a bit weird how almost every place refers to Node.Pair by its full name, but one line refers to it simply as Pair - that seemed deliberate enough that I actually thought they were two different classes at first. Being consistent about how to refer to things, especially within a single function, does make the logic a bit easier to follow

Naming

Variable names shouldn't describe how the data is shaped but rather the data's purpose. A name like nodeIterator tells me nothing that the type declaration Iterator<Node> doesn't. The vector parameter in particular seems to have interesting traits and a clear purpose (if I'm understanding the code correctly, it's a partial coloring, a valid coloring of some subgraph of the graph we're working on), but its name tells readers absolutely none of that - a name like partialColoring or coloringCandidate or something similar would make that much clearer
Consistency matters. Having nodeIt be a Node but adjNodesIt be an Iterator<Node> is confusing. Calling nodeIt something like currentNode would make the naming more consistent and make that variable's purpose a bit clearer
When there's a thing called x and a thing called subX, I think most readers would usually expect subX to have the same shape as x, only smaller (like how a subset is-a set). But here subCluster is not a smaller cluster but an element of it - their shapes are different enough they even have different data types. I think replacing the names subCluster, cluster with cluster, nodeData, or even cluster, clusters (and doing the equivalent for subAdjCluster, adjCluster) might communicate the relationship between those variables more clearly

Thanks Sarah for your comments, but what i am looking for is rather comments on the algorithm implementation (basically the coloring method) how to color each node based on the tuples intersections, how to move to the next node and when etc... am i maybe processing nodes many times, is there a more optimized approach to browse the graph, the limit cases (the process of the first and the last nodes) these kind of stuff you know ? You could not understand the algorithm from what the figure shows and from what i described in my question ? is it not clear enough ?

Stack Exchange Network

Graph coloring problem with Spark (JAVA)?

1 Answer 1

`coloring`

colorNode

Other notes

Nitpicks

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Graph coloring problem with Spark (JAVA)?

1 Answer 1

coloring

colorNode

Other notes

Nitpicks

Naming

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`coloring`