I have just started learning graph algorithms and am trying to arrive at an ideal interface. I understand that this code will not be used anywhere else (I will certainly use Boost::Graph) but I just want to make sure that what I am writing is not completely wrong.
Specifically, my question in the implementation below regards the data structure used to store nodes/vertices and edges. All the other students in the MOC I am enrolled in use an std::vector
to represent the array of vertices/nodes and edges in the graph.
However, I was wondering if it might be more prudent to use an std::unordered_map
(i.e. a hashtable) instead. This is primarily because node removal is something that is expected to happen quite frequently in many graph algorithms (eg. Karger's min-cut) and storing nodes in an std::vector
should make these linear time operations. Storing it in an std::unordered_map will allow for removal/insertion/lookup in constant time.
The only drawback of such an implementation vs one that stores the nodes as a contiguous array is that finding a random edge/node is necessarily a linear time operation in the hash table implementation while it is a constant time in the array based implementation.
Is my reasoning above correct?
Any other comments regarding the code is also welcome.
namespace algorithms{
/// \struct Graph
/// \brief Representation of graph with vertices and edges
template <typename ValueType>
struct Graph{
/// \brief Constructs a graph with 0 vertices and 0 edges
Graph() = default;
/// \brief Constructs a graph with n vertices and 0 edges
Graph(const int& num_vertices);
/// \brief Copy constructor
Graph(const Graph& other);
/// \brief Creates a new vertex with specified id and returns it
/// This is ideal for creating graphs based on adjacency lists
/// \note Vertex creation will fail if there is already another vertex
/// with specified id already
/// \return true if vertex creation succeeded, false otherwise
bool create_vertex(const int& vertex_id);
/// \brief Removes a vertex from this graph
/// \note Removal can fail if no vertex exists with specified id
/// \return true if vertex removal succeeded, false otherwise
bool remove_vertex(const int& vertex_id);
/// \brief Adds an edge to the graph
/// \return true if edge addition succeeds, false if it fails
bool add_edge(const int& first_vertex_id, const int& second_vertex_id);
/// \brief Removes an edge from graph
bool remove_edge(const int& first_vertex_id, const int& second_vertex_id);
/// \brief Returns the number of vertices in graph
int get_vertex_count() const;
/// \brief Returns the number of edges in graph
int get_edge_count() const;
/// \brief Fills output vector with the ids of neighbouring vertices to specified one
/// \return true if there is a vertex with specified id, false otherwise
bool get_adjacent_vertices(const int& vertex_id, std::vector<int>& output_adjacent_vertices) const;
/// \brief Fills output vector with adjacent vertices along with number of edges between
/// specified vertex and the corresponding neighbour
/// \param[in] vertex_id id of vertex whose neighbours are sought
/// \param[in] consider_directed boolean indicating if edges are directed or not
/// \param[out] adjacent_vertices output vector of std::pair(adjacent vertex, num edges between specified vertex and this adjacent one)
/// \return true if there is a vertex with specified id, false otherwise
bool get_adjacent_vertices(const int& vertex_id, std::vector<std::pair<int, int>>& output_adjacent_vertices, bool consider_directed) const;
/// \brief Returns a random vertex from the graph
/// \note This is a function with linear time complexity
int get_random_vertex_id();
/// \brief Returns a random edge from the graph, i.e. it sets the two output params
/// with the vertex ids of the endpoints of this randomly chosen edge
/// \note This is a function with linear time complexity
void get_random_edge(int& start_vertex, int& end_vertex);
/// \brief Returns whether an edge exists between the two specified vertex ids
bool has_edge_between(const int& start_vertex, const int& end_vertex, bool consider_directed_edges_only) const;
/// \brief Populates the input vector with the list of edges in it
void get_edge_list(std::vector<std::pair<int, int>>& output_edge_list) const;
/// \brief Returns number of edges between the two specified vertices
int get_num_edges_between(const int& start_vertex, const int& end_vertex, bool consider_directed_edges_only) const;
/// \brief Sets value for a particular vertex
/// \return true if a vertex exists with specified id, false otherwise
bool set_value(const int& vertex_id, const ValueType& value);
/// \brief Returns value of a particular vertex
/// \return true if a vertex exists with specified id, false otherwise
bool get_value(const int& vertex_id, ValueType& output) const;
/// \brief Overloads the stream operator to print this graph out
template<typename VType>
friend std::ostream& operator<<(std::ostream& os, const Graph<VType>& dt);
private:
/// \struct GraphVertex
/// \brief A single vertex in a Graph
struct GraphVertex{
/// \brief Id uniquely identifying this vertex
int vertex_id;
/// \brief Vector holding ids of adjacent vertices to this one
/// \note We use a std::list instead of an std::vector to allow of efficient
/// addition and removal of graph vertices
std::unordered_set<int> adjacent_vertices;
/// \brief Value to be assigned to this vertex
ValueType value;
};
/// \struct PairHash
/// \brief Provides a hash for an std::pair
template <class T, typename U>
struct PairHash{
size_t operator()(const std::pair<T, U> &key) const{
return std::hash<T>()(key.first) ^ std::hash<U>()(key.second);
}
};
/// \struct PairEqual
/// \brief Struct used for equality comparison in unordered maps with
/// a pair as its key
template <class T, typename U>
struct PairEqual{
bool operator()(const std::pair<T, U> &lhs, const std::pair<T, U> &rhs) const{
return lhs.first == rhs.first && lhs.second == rhs.second;
}
};
/// \struct GraphEdge
/// \brief A single edge in a Graph
struct GraphEdge{
/// \brief Array of two vertices making up this edge
int end_points[2];
};
/// \brief A hash map containing the vertices in this graph refrenced against their ids
std::unordered_map<int, std::unique_ptr<GraphVertex>> vertices;
/// \brief A hash multimap of all the edges referenced by the ids of each edge's constituent vertices
/// The reason for using an unordered_multimap instead of an unordered_map is to allow for multiple parallel edges between two vertices
std::unordered_multimap<std::pair<int, int>,
std::unique_ptr<GraphEdge>,
PairHash<int, int>,
PairEqual<int, int>> edges;
/// \brief Random number generator
std::mt19937 mt(std::random_device());
/// \brief Flag keeping track of whether generator has been seeded
bool seeded = false;
};
template <typename ValueType>
std::ostream& operator<<(std::ostream& os, const algorithms::Graph<ValueType>& graph);
}
#include <algorithms/graph/graph.tch>
#endif // ALGORITHMS_GRAPH
-
1\$\begingroup\$ How is finding a random edge/node constant time in an array? Also complexity ignored constant factors and is only meaningful for "a lot" of elements, and sometimes you run out of memory before you get to numbers where it's worth it. \$\endgroup\$nwp– nwp2015年11月17日 08:58:35 +00:00Commented Nov 17, 2015 at 8:58
-
\$\begingroup\$ @nwp Finding an edge is constant time in an array because once you generate a random index 'r' in [0, array_size), you can jump to the r'th index in constant time. I don't understand what you mean by: "Also complexity ignored constant factors and is only meaningful for "a lot" of elements", nor in what context you say that: "sometimes you run out of memory before you get to numbers where it's worth it." \$\endgroup\$balajeerc– balajeerc2015年11月17日 10:03:57 +00:00Commented Nov 17, 2015 at 10:03
-
\$\begingroup\$ Don't assume that O(1) is faster than O(n). You should measure that based on your data set size. There is a lot of hidden complexity in unordered_map. Also remember that vector gains a lot of sped from locality and hardware caching. \$\endgroup\$Loki Astari– Loki Astari2015年11月17日 16:51:50 +00:00Commented Nov 17, 2015 at 16:51
-
\$\begingroup\$ Also the removal of a node does not nesacerily mean moving all the other nodes in the vector. Actually I would think the exact opposite. As the position of the node is probably its identifying feature. Removal of a node would simply be marking it as dead an O(1) operation. \$\endgroup\$Loki Astari– Loki Astari2015年11月17日 16:55:38 +00:00Commented Nov 17, 2015 at 16:55
1 Answer 1
In general your reasoning is ok
I think it is very valid to try to optimize the time complexity for certain operations if you do not know the size of the structures beforehand. That said, of course it helps to have actual use cases for performance comparisons.
Provide the full code
It's hard to provide feedback without the actual implementation of your methods. For instance I would think your implementation of get_adjacent_vertices
is probably inefficient based on your data structures.
Improve your documentation
- Your documentation is out of sync with the code. That is very bad, worse than no documentation. ("Creates a new vertex with specified id and returns it", "We use a std::list instead of an std::vector")
- Your documentation is missing the most important information: What is your actual graph? directed? multiple edges? values at vertices/edges?
- It should always be clear for all methods what failure conditions are. Also make clear when failure doesn't mean exception but return flags.
Vertex "type"
What actually represents a vertex in your graph? Do you really want to have separate ids for vertices? Or is a vertex uniquely identified by it's value? In any case you should use a vertex_type
to identify your vertices in method paramters, similar to how std::vector::size_type
is used.
Return by value or reference instead of by input reference
For instance your get_value
always requires an unnecessary and potentially expensive copy operation (from the internal GraphVertex::value
to the outside variable provided by the caller through the reference). Consider implementing ValueType& operator[](int vertex_id)
and const ValueType& operator[](int vertex_id) const
instead of get_value
and set_value
. This is also more consistent with common interfaces.
For example in your current get_edge_list
- what happens if there are already elements in the provided vector? It complicates things and is more difficult to use. It requires to user to define a variable first and then passing it into your methods rather than defining it directly from a return value.
If you have multiple return values use a std::pair
or GraphEdge
.
Take int
s by value instead of const&
If you pass small types that are cheap to copy, it is preferred to take them by value. Note: if you do not know the type in generic code, a const&
is just fine.
Allow for in-place construction
Currently a user has to first create a vertex and then assign it a value in a second method call. This is complicated and error prone. Instead allow for in-place construction (emplace) or at least copy construction of new vertices.
GraphEdge
There are a number of things wrong with GraphEdge
.
- You don't use it consistently. Sometimes you use
std::pair
instead - Don't use an array there. Seriously. Why would you introduce so many ways to shoot yourself in the foot without need.
PairHash
Your PairHash
implementation is bad, because it means hash({a,b}) == hash({b,a})
(but not {a,b} == {b,a}).
PairEqual
There are already operator==
and friends for std::pair
.
Avoid inconsistent state
Instead of keeping track if your rng is seeded
, seed it in the constructor.
Follow standard naming conventions
I find some of your names to be too verbose and far away from common naming schemes (stl, boost). Especially get_...
.
For instance get_num_edges_between
. I would think count_edges(vertex_type, vertex_type, directed)
is just fine.
Avoid indirection if you don't need to
Why do you manage the GraphEdge
by unique_ptr
inside edges
? This has negative performance impact. Use such an indirection only if you need to move the objects around often but cannot cheaply do so.
The same argument applies to vertices
, but it requires more careful reasoning because the GraphVertex
object is more complex. Is it required to often move around GraphVertex
es in std::unordered_map
operations? (I don't think so) Is it expensive to move a GraphVertex
object? (Depends on ValueType.)