Graph Interface: Storing Vertices in an Array vs HashTable

Question 1

I have just started learning graph algorithms and am trying to arrive at an ideal interface. I understand that this code will not be used anywhere else (I will certainly use Boost::Graph) but I just want to make sure that what I am writing is not completely wrong.

Specifically, my question in the implementation below regards the data structure used to store nodes/vertices and edges. All the other students in the MOC I am enrolled in use an std::vector to represent the array of vertices/nodes and edges in the graph.

However, I was wondering if it might be more prudent to use an std::unordered_map (i.e. a hashtable) instead. This is primarily because node removal is something that is expected to happen quite frequently in many graph algorithms (eg. Karger's min-cut) and storing nodes in an std::vector should make these linear time operations. Storing it in an std::unordered_map will allow for removal/insertion/lookup in constant time.

The only drawback of such an implementation vs one that stores the nodes as a contiguous array is that finding a random edge/node is necessarily a linear time operation in the hash table implementation while it is a constant time in the array based implementation.

Is my reasoning above correct?

Any other comments regarding the code is also welcome.

namespace algorithms{
 /// \struct Graph
 /// \brief Representation of graph with vertices and edges
 template <typename ValueType>
 struct Graph{
 /// \brief Constructs a graph with 0 vertices and 0 edges
 Graph() = default;
 /// \brief Constructs a graph with n vertices and 0 edges
 Graph(const int& num_vertices);
 /// \brief Copy constructor
 Graph(const Graph& other);
 /// \brief Creates a new vertex with specified id and returns it
 /// This is ideal for creating graphs based on adjacency lists
 /// \note Vertex creation will fail if there is already another vertex
 /// with specified id already
 /// \return true if vertex creation succeeded, false otherwise
 bool create_vertex(const int& vertex_id);
 /// \brief Removes a vertex from this graph
 /// \note Removal can fail if no vertex exists with specified id
 /// \return true if vertex removal succeeded, false otherwise
 bool remove_vertex(const int& vertex_id);
 /// \brief Adds an edge to the graph
 /// \return true if edge addition succeeds, false if it fails
 bool add_edge(const int& first_vertex_id, const int& second_vertex_id);
 /// \brief Removes an edge from graph
 bool remove_edge(const int& first_vertex_id, const int& second_vertex_id);
 /// \brief Returns the number of vertices in graph
 int get_vertex_count() const;
 /// \brief Returns the number of edges in graph
 int get_edge_count() const;
 /// \brief Fills output vector with the ids of neighbouring vertices to specified one
 /// \return true if there is a vertex with specified id, false otherwise
 bool get_adjacent_vertices(const int& vertex_id, std::vector<int>& output_adjacent_vertices) const;
 /// \brief Fills output vector with adjacent vertices along with number of edges between
 /// specified vertex and the corresponding neighbour
 /// \param[in] vertex_id id of vertex whose neighbours are sought
 /// \param[in] consider_directed boolean indicating if edges are directed or not
 /// \param[out] adjacent_vertices output vector of std::pair(adjacent vertex, num edges between specified vertex and this adjacent one)
 /// \return true if there is a vertex with specified id, false otherwise
 bool get_adjacent_vertices(const int& vertex_id, std::vector<std::pair<int, int>>& output_adjacent_vertices, bool consider_directed) const;
 /// \brief Returns a random vertex from the graph
 /// \note This is a function with linear time complexity
 int get_random_vertex_id();
 /// \brief Returns a random edge from the graph, i.e. it sets the two output params
 /// with the vertex ids of the endpoints of this randomly chosen edge
 /// \note This is a function with linear time complexity
 void get_random_edge(int& start_vertex, int& end_vertex);
 /// \brief Returns whether an edge exists between the two specified vertex ids
 bool has_edge_between(const int& start_vertex, const int& end_vertex, bool consider_directed_edges_only) const;
 /// \brief Populates the input vector with the list of edges in it
 void get_edge_list(std::vector<std::pair<int, int>>& output_edge_list) const;
 /// \brief Returns number of edges between the two specified vertices
 int get_num_edges_between(const int& start_vertex, const int& end_vertex, bool consider_directed_edges_only) const;
 /// \brief Sets value for a particular vertex
 /// \return true if a vertex exists with specified id, false otherwise
 bool set_value(const int& vertex_id, const ValueType& value);
 /// \brief Returns value of a particular vertex
 /// \return true if a vertex exists with specified id, false otherwise
 bool get_value(const int& vertex_id, ValueType& output) const;
 /// \brief Overloads the stream operator to print this graph out
 template<typename VType>
 friend std::ostream& operator<<(std::ostream& os, const Graph<VType>& dt);
 private:
 /// \struct GraphVertex
 /// \brief A single vertex in a Graph
 struct GraphVertex{
 /// \brief Id uniquely identifying this vertex 
 int vertex_id;
 /// \brief Vector holding ids of adjacent vertices to this one
 /// \note We use a std::list instead of an std::vector to allow of efficient
 /// addition and removal of graph vertices
 std::unordered_set<int> adjacent_vertices;
 /// \brief Value to be assigned to this vertex
 ValueType value;
 };
 /// \struct PairHash
 /// \brief Provides a hash for an std::pair
 template <class T, typename U>
 struct PairHash{
 size_t operator()(const std::pair<T, U> &key) const{
 return std::hash<T>()(key.first) ^ std::hash<U>()(key.second);
 }
 };
 /// \struct PairEqual
 /// \brief Struct used for equality comparison in unordered maps with
 /// a pair as its key
 template <class T, typename U>
 struct PairEqual{
 bool operator()(const std::pair<T, U> &lhs, const std::pair<T, U> &rhs) const{
 return lhs.first == rhs.first && lhs.second == rhs.second;
 }
 };
 /// \struct GraphEdge
 /// \brief A single edge in a Graph 
 struct GraphEdge{
 /// \brief Array of two vertices making up this edge 
 int end_points[2];
 };
 /// \brief A hash map containing the vertices in this graph refrenced against their ids
 std::unordered_map<int, std::unique_ptr<GraphVertex>> vertices;
 /// \brief A hash multimap of all the edges referenced by the ids of each edge's constituent vertices
 /// The reason for using an unordered_multimap instead of an unordered_map is to allow for multiple parallel edges between two vertices
 std::unordered_multimap<std::pair<int, int>, 
 std::unique_ptr<GraphEdge>, 
 PairHash<int, int>,
 PairEqual<int, int>> edges;
 /// \brief Random number generator
 std::mt19937 mt(std::random_device());
 /// \brief Flag keeping track of whether generator has been seeded
 bool seeded = false;
 };
 template <typename ValueType>
 std::ostream& operator<<(std::ostream& os, const algorithms::Graph<ValueType>& graph);
}
#include <algorithms/graph/graph.tch>
#endif // ALGORITHMS_GRAPH

Question 2

How is finding a random edge/node constant time in an array? Also complexity ignored constant factors and is only meaningful for "a lot" of elements, and sometimes you run out of memory before you get to numbers where it's worth it.

Question 3

@nwp Finding an edge is constant time in an array because once you generate a random index 'r' in [0, array_size), you can jump to the r'th index in constant time. I don't understand what you mean by: "Also complexity ignored constant factors and is only meaningful for "a lot" of elements", nor in what context you say that: "sometimes you run out of memory before you get to numbers where it's worth it."

Question 4

Don't assume that O(1) is faster than O(n). You should measure that based on your data set size. There is a lot of hidden complexity in unordered_map. Also remember that vector gains a lot of sped from locality and hardware caching.

Question 5

Also the removal of a node does not nesacerily mean moving all the other nodes in the vector. Actually I would think the exact opposite. As the position of the node is probably its identifying feature. Removal of a node would simply be marking it as dead an O(1) operation.

Question 6

In general your reasoning is ok

I think it is very valid to try to optimize the time complexity for certain operations if you do not know the size of the structures beforehand. That said, of course it helps to have actual use cases for performance comparisons.

Provide the full code

It's hard to provide feedback without the actual implementation of your methods. For instance I would think your implementation of get_adjacent_vertices is probably inefficient based on your data structures.

Improve your documentation

Your documentation is out of sync with the code. That is very bad, worse than no documentation. ("Creates a new vertex with specified id and returns it", "We use a std::list instead of an std::vector")
Your documentation is missing the most important information: What is your actual graph? directed? multiple edges? values at vertices/edges?
It should always be clear for all methods what failure conditions are. Also make clear when failure doesn't mean exception but return flags.

Vertex "type"

What actually represents a vertex in your graph? Do you really want to have separate ids for vertices? Or is a vertex uniquely identified by it's value? In any case you should use a vertex_type to identify your vertices in method paramters, similar to how std::vector::size_type is used.

Return by value or reference instead of by input reference

For instance your get_value always requires an unnecessary and potentially expensive copy operation (from the internal GraphVertex::value to the outside variable provided by the caller through the reference). Consider implementing ValueType& operator[](int vertex_id) and const ValueType& operator[](int vertex_id) const instead of get_value and set_value. This is also more consistent with common interfaces.

For example in your current get_edge_list - what happens if there are already elements in the provided vector? It complicates things and is more difficult to use. It requires to user to define a variable first and then passing it into your methods rather than defining it directly from a return value.

If you have multiple return values use a std::pair or GraphEdge.

Take `int`s by value instead of `const&`

If you pass small types that are cheap to copy, it is preferred to take them by value. Note: if you do not know the type in generic code, a const& is just fine.

Allow for in-place construction

Currently a user has to first create a vertex and then assign it a value in a second method call. This is complicated and error prone. Instead allow for in-place construction (emplace) or at least copy construction of new vertices.

`GraphEdge`

There are a number of things wrong with GraphEdge.

You don't use it consistently. Sometimes you use std::pair instead
Don't use an array there. Seriously. Why would you introduce so many ways to shoot yourself in the foot without need.

`PairHash`

Your PairHash implementation is bad, because it means hash({a,b}) == hash({b,a}) (but not {a,b} == {b,a}).

`PairEqual`

There are already operator== and friends for std::pair.

Avoid inconsistent state

Instead of keeping track if your rng is seeded, seed it in the constructor.

Follow standard naming conventions

I find some of your names to be too verbose and far away from common naming schemes (stl, boost). Especially get_....

For instance get_num_edges_between. I would think count_edges(vertex_type, vertex_type, directed) is just fine.

Avoid indirection if you don't need to

Why do you manage the GraphEdge by unique_ptr inside edges? This has negative performance impact. Use such an indirection only if you need to move the objects around often but cannot cheaply do so.

The same argument applies to vertices, but it requires more careful reasoning because the GraphVertex object is more complex. Is it required to often move around GraphVertexes in std::unordered_map operations? (I don't think so) Is it expensive to move a GraphVertex object? (Depends on ValueType.)

Zulan ZulanZulan 7823 silver badges12 bronze badges · Answer 1 · 2015-11-18 09:12:47Z

In general your reasoning is ok

I think it is very valid to try to optimize the time complexity for certain operations if you do not know the size of the structures beforehand. That said, of course it helps to have actual use cases for performance comparisons.

Provide the full code

It's hard to provide feedback without the actual implementation of your methods. For instance I would think your implementation of get_adjacent_vertices is probably inefficient based on your data structures.

Improve your documentation

Your documentation is out of sync with the code. That is very bad, worse than no documentation. ("Creates a new vertex with specified id and returns it", "We use a std::list instead of an std::vector")
Your documentation is missing the most important information: What is your actual graph? directed? multiple edges? values at vertices/edges?
It should always be clear for all methods what failure conditions are. Also make clear when failure doesn't mean exception but return flags.

Vertex "type"

What actually represents a vertex in your graph? Do you really want to have separate ids for vertices? Or is a vertex uniquely identified by it's value? In any case you should use a vertex_type to identify your vertices in method paramters, similar to how std::vector::size_type is used.

Return by value or reference instead of by input reference

For instance your get_value always requires an unnecessary and potentially expensive copy operation (from the internal GraphVertex::value to the outside variable provided by the caller through the reference). Consider implementing ValueType& operator[](int vertex_id) and const ValueType& operator[](int vertex_id) const instead of get_value and set_value. This is also more consistent with common interfaces.

For example in your current get_edge_list - what happens if there are already elements in the provided vector? It complicates things and is more difficult to use. It requires to user to define a variable first and then passing it into your methods rather than defining it directly from a return value.

If you have multiple return values use a std::pair or GraphEdge.

Take `int`s by value instead of `const&`

If you pass small types that are cheap to copy, it is preferred to take them by value. Note: if you do not know the type in generic code, a const& is just fine.

Allow for in-place construction

Currently a user has to first create a vertex and then assign it a value in a second method call. This is complicated and error prone. Instead allow for in-place construction (emplace) or at least copy construction of new vertices.

`GraphEdge`

There are a number of things wrong with GraphEdge.

You don't use it consistently. Sometimes you use std::pair instead
Don't use an array there. Seriously. Why would you introduce so many ways to shoot yourself in the foot without need.

`PairHash`

Your PairHash implementation is bad, because it means hash({a,b}) == hash({b,a}) (but not {a,b} == {b,a}).

`PairEqual`

There are already operator== and friends for std::pair.

Avoid inconsistent state

Instead of keeping track if your rng is seeded, seed it in the constructor.

Follow standard naming conventions

I find some of your names to be too verbose and far away from common naming schemes (stl, boost). Especially get_....

For instance get_num_edges_between. I would think count_edges(vertex_type, vertex_type, directed) is just fine.

Avoid indirection if you don't need to

Why do you manage the GraphEdge by unique_ptr inside edges? This has negative performance impact. Use such an indirection only if you need to move the objects around often but cannot cheaply do so.

The same argument applies to vertices, but it requires more careful reasoning because the GraphVertex object is more complex. Is it required to often move around GraphVertexes in std::unordered_map operations? (I don't think so) Is it expensive to move a GraphVertex object? (Depends on ValueType.)

Stack Exchange Network

Graph Interface: Storing Vertices in an Array vs HashTable

1 Answer 1

In general your reasoning is ok

Provide the full code

Improve your documentation

Vertex "type"

Return by value or reference instead of by input reference

Take `int`s by value instead of `const&`

Allow for in-place construction

`GraphEdge`

`PairHash`

`PairEqual`

Avoid inconsistent state

Follow standard naming conventions

Avoid indirection if you don't need to

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Graph Interface: Storing Vertices in an Array vs HashTable

1 Answer 1

In general your reasoning is ok

Provide the full code

Improve your documentation

Vertex "type"

Return by value or reference instead of by input reference

Take ints by value instead of const&

Allow for in-place construction

GraphEdge

PairHash

PairEqual

Avoid inconsistent state

Follow standard naming conventions

Avoid indirection if you don't need to

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Take `int`s by value instead of `const&`

`GraphEdge`

`PairHash`

`PairEqual`