Find the most-connected nodes in a graph

Question 1

I feel like this is a simple problem that I have overcomplicated. Any tips or different approach would be greatly appreciated.

Problem and goal

Given a list of coordinates (each line represents an 'edge' containing 'nodes') my goal is to obtain (half of the amount of the total) nodes that occur the most. I hope my example would clarify this problem:

Input file

Each line represents a 'coordinate' (in reality an 'Edge'), that consists of 2 nodes (integers).

Desired output

Preferably a vector that contains: [7,20,1]

Why 7,20,1? Because 7, 20 and 1 occur the most if you look at the input file. This is also half (3) of the total amount (6) of coordinates. Note: the output vector doesn't have to be sorted.

My goal is to increase the performance of my current approach (any other approach however is very appreciated (but please check my constraints in the next section). The current performance seems to be slow for 1 million edges (lines). An example of such a large coordinates file can be obtained from here (8 MB file).

Constraints

The input file is being read and the results are added as a vector of struct with type Coordinate (as my code will show). Please note that I can not change this (due to constraints that are beyond the scope of my question). In other words, the input parameter to my function is a vector<Coordinate>.
I would like to obtain an array/vector as an output of the highest occurrences.

My current approach and code

Essentially, I am 'looping' through my coordinates in the select_coordinates_highest_occurrence function and add each coordinate (along with its first and second element) to a map (function add_node_to_node_degree_map) that keeps track of the amount of the occurrences. If I do a simple benchmark, this seems to also be the main bottleneck of the code.

I then proceed by converting the map to a vector, since I am unable to sort the vector by the value.

After I have done that, I sort the vector (ascending) and 'cut' it in half so that I obtain the highest amount of coordinates.

#include <iostream>
#include <sstream>
#include <vector>
#include <fstream>
#include <string>
#include <unordered_map>
#include <algorithm>
typedef struct Coordinate {
 int x;
 int y;
} Coordinate;
void add_node_to_node_degree_map(std::unordered_map<int, int>& node_degree, int node) {
 if (!node_degree.count(node)) {
 node_degree[node] = 1;
 }
 else {
 node_degree[node] = node_degree[node] + 1;
 }
}
void select_coordinates_highest_occurrence(std::vector<Coordinate>& coordinates) {
 // Map all vertices onto a map along with their degree
 std::unordered_map<int, int> node_degree;
 for (auto &edge : coordinates) {
 add_node_to_node_degree_map(node_degree, edge.x);
 add_node_to_node_degree_map(node_degree, edge.y);
 }
 // Convert the map to a vector
 std::vector<std::pair<int, int>> node_degree_vect;
 for (auto it = node_degree.cbegin(); it != node_degree.cend(); ++it)
 {
 node_degree_vect.push_back(std::make_pair(it->first, it->second));
 }
 // Sort the vector (ascending, high degree nodes are on top)
 std::sort(node_degree_vect.begin(), node_degree_vect.end(), [](const std::pair<int, int> &left, const std::pair<int, int> &right) {
 return left.second > right.second;
 });
 // Only collect the high occurrences (delete the 'other' half of the vector with lower occurrences)
 std::vector<int> high_occurrences;
 for (int i = 0; i < node_degree_vect.size() / 2; i++) {
 high_occurrences.push_back(node_degree_vect[i].first);
 }
 for (int i = 0; i < high_occurrences.size(); i++) {
 printf("\n%d", high_occurrences[i]);
 }
}
/*
This method can be ignored and is only added to load a graph in case you want to test it on your system.
*/
void read_coordinates_into_vector(std::vector<Coordinate>& coordinates, std::string file_path) {
 std::ifstream infile(file_path);
 std::cout << file_path;
 int vertex_source;
 int vertex_destination;
 while (infile >> vertex_source >> vertex_destination) {
 Coordinate edge;
 edge.x = vertex_source;
 edge.y = vertex_destination;
 coordinates.push_back(edge);
 }
 infile.close();
}
int main() {
 std::vector<Coordinate> coordinates;
 std::string file_path = "C:\\Users\\USERNAMEHERE\\Desktop\\coordinates_list.txt";
 read_coordinates_into_vector(coordinates,file_path);
 select_coordinates_highest_occurrence(coordinates);
 return 0;
}

I really feel like I overcomplicated this solution.

Question 2

When I run this against the small example provided in your question it prints 2 1 and not 2 0 1. Is this desired or did I break something?

Question 3

@yuri You probably broke something. It prints 2 0 1 for me too.

Question 4

@yuri hi yuri, it prints 2 0 1 for me, but the 1 might be a little bit harder to see since it is next to the 'Press any key to continue' message to exit the command line terminal (on Windows at least)

Question 5

Oh yeah indeed, there was no trailing newline and I did not notice the 0. Apologies.

Question 6

Are the node-numbers guaranteed to be consecutive, or at least a reasonable percentage of all node-ids from 0 to last guaranteed used?

Question 7

Here are some things that may help you improve your program.

Clarify your specification

In the text, you describe the desired quantity of nodes as "half the amount of coordinates" but what the code actually does is choose half the number of unique edges. There could well be a difference if, say, we had a thousand coordinate pairs that each referred to the same two edges.

Understand standard structures

There is a much simpler way to create your unordered_map that doesn't require any helper function. Because each value is initialized to zero, we can rely on this and simplify the code considerably:

std::unordered_map<int, int> node_degree;
for (const auto &coord : coordinates) {
 ++node_degree[coord.x];
 ++node_degree[coord.y];
}

Use appropriate data structures

Your use of an unordered_map is a sound idea, but I'm not so sure about the conversion to a std::vector. It seems to me that a std::priority_queue more directly matches what you're trying to accomplish. Here's the equivalent to your select_coordinates_highest_occurrence (which I find a bit wordy):

void popularEdges(const std::vector<Coordinate> & coordinates) {
 std::unordered_map<int, int> m;
 for (const auto &coord : coordinates) {
 ++m[coord.x];
 ++m[coord.y];
 }
 std::priority_queue<std::pair<int, int>> pq;
 for (const auto &node : m) {
 pq.emplace(node.second, node.first);
 }
 for (auto n=pq.size()/2; n; --n) {
 std::cout << pq.top().second << '\n';
 pq.pop();
 }
}

In my testing on my computer, this version is 25% faster than the original.

Separate I/O from processing

It may depend on your particular needs, but I'd suggest that passing back a vector would be much more generally useful than simply printing from within the function. Similarly, the read_coordinates_into_vector prints the file name back to the user. I'd expect instead that it would be silent. The caller can just as easily print that string if needed.

Make return variables return variables

The read_coordinates_into_vector currently requires the user to pass in a vector, but what would likely make more sense would be to have the code return a newly created vector instead.

Don't use spurious `typedef`

C++ is not C. The typedef here is not needed:

typedef struct Coordinate { /*...*/ };

because in C++, Coordinate is a proper type without the additional keyword.

Prefer `std::istream` to file names

The current code for read_coordinates_into_vector is only capable of handing a named file and not, say, a std::stringstream. It could be made more generic and, I think, more durable, if the function prototype were this:

std::vector<Coordinate> read_coordinates_into_vector(std::istream &in);

Use a `friend` function and custom inserter

Use a friend function and custom inserter to simplify reading coordinates from a file:

struct Coordinate {
 int x;
 int y;
 friend std::istream &operator>>(std::istream &in, Coordinate &c) {
 return in >> c.x >> c.y;
 }
};
std::vector<Coordinate> read(std::istream &in) {
 std::vector<Coordinate> v;
 Coordinate pair;
 while (in >> pair) {
 v.push_back(edge);
 }
 return v;
}

Question 8

Node: If a key doesn't exist in an std::unordered_map, it is not zero initialized, but value initialized. Which is technically not the same thing.

Question 9

I've changed the wording to simply say that it is "initialized to zero" without specifying the particular mechanism.

Question 10

I do believe typedef struct Coordinate { /*...*/ }; is an error. You dropped a second Coordinate.

Question 11

It's not exactly an error; it's just weird. The syntax in the original is legal but strange C++ and the version I show above is legal C++ and not strange. Dropping the second Coordinate is required or it would be a syntax error when Coordinate is used later.

Question 12

add_node_to_node_degree_map always does double the work needed, specifically two lookups (the second might be inserting) where one would suffice.
Easily fixed:

void add_node_to_node_degree_map(std::unordered_map<int, int>& node_degree, int node) {
 ++node_degree[node];
}

You can directly initialize your node_degree_vect from node_degree:

std::vector<std::pair<int, int>> node_degree_vect(node_degree.begin(), node_degree.end());

Anyway, I wonder why you didn't at least use a for-range-loop there.

If you only collect the result in a std::vector to immediately output it, I wonder why you didn't just output it and skipped the intermediate step.

I wonder why you explicitly close a file manually just for RAII to kick in and do that in the destructor...

Question 13

On my machine (i7 6700HQ running gcc 7.1.1 and compiled with -O3) your code runs in the following time:

./main 0.57s user 0.18s system 89% cpu 0.834 total

That's pretty good. No matter what I tried, I always got a value around 0.5s, so I can't provide anything that can speed up your code, sorry.

Compile with every warning turned on. In my case, g++ reports 2 things:

main.cpp: In function 'void select_coordinates_highest_occurrence(std::vector<Coordinate>&)':
main.cpp:48:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
 for (int i = 0; i < node_degree_vect.size() / 2; i++) {
 ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~
main.cpp:52:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
 for (int i = 0; i < high_occurrences.size(); i++) {
 ~~^~~~~~~~~~~~~~~~~~~~~~~~~

Please fix them.

I don't know if this goes against your requirement, but typedefing structs is only a thing in C. In C++, there is no difference, so just leave it out.
Mark structs that shouldn't be used as base classes final, to prevent any potential bugs about a non-virtual destructor.
My opinion: Using integers as booleans is confusing. So, I would change
```
if (!node_degree.count(node))
```
to
```
if (node_degree.count(node) == 0)
```
If you read std::unordered_map::operator[], you'll see that it inserts a default constructed element if it doesn't exist. So, add_node_to_node_degree_map can be simplified to:
```
++node_degree[node];
```
Which also means that you don't need a function for it.
Every binary arithmetic operator (let's call one %%) can be rewritten if it has the form
```
num = num %% a; // 'a' is another number
```
Using %%=:
```
num %%= a;
```
Example, node_degree[node] = node_degree[node] + 1; can become node_degree[node] += 1;, or in this case, even ++node_degree[node]; because you're only adding one.
Mark functions that do not throw as noexcept to help the compiler optimize (if you compile without exceptions, ignore this.)
Your naming is a bit too specific, and as a result, they are long, especially function names. Try to find shorter names, but not too short :)
I can only see one printf, so I'm not sure on this one:
- If you always use printf, even if you are outputting an integer, then don't, because you are using the C function, which is not guaranteed to be included. In this case, #include <cstdio> and change it to std::printf.
- If you removed it because you know that ADL will find printf anyways in std in this case, good job :) Still, you need to #include <cstdio>.

Use more range-based for loops:

for (const auto &node : node_degree)
 node_degree_vect.push_back(std::make_pair(node.first, node.second));

You are inconsistent with braces, but only in one case. The one replaced above is the only one with opening braces being on a separate line.
int is not guaranteed to cover the whole range of std::size_t, which is returned by most .size() calls. So, you'll need to use std::size_t instead of int in loops:
```
for (std::size_t i = 0; i < node_degree_vect.size() / 2; ++i) {
 high_occurrences.push_back(node_degree_vect[i].first);
}
```
Prefer pre-increment to post-increment. Even though for ints this is negligible and the compiler will optimize the temporary away, this might not be the case for other heavier types (like iterators.)

Use auto.

[](const std::pair<int, int> &left, const std::pair<int, int> &right) {
 return left.second > right.second;
}

can become

[](const auto &left, const auto &right) {
 return left.second > right.second;
}

I think you didn't write this, but if you did: std::ifstream's destructor closes the file, so you don't need to at the end.
A function returning void and taking a reference to a type is not very nice. Consider returning the variable instead of taking it by reference. In this case, you can initialize coordinates in 1 line instead of 2.
Print a trailing new line please.
select_coordinates_highest_occurrence doesn't modify coordinates. So please mark it const to prevent any accidental modifications.
You don't need return 0; for main, you can safely remove it.

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2017-06-07 15:25:00Z

Here are some things that may help you improve your program.

Clarify your specification

In the text, you describe the desired quantity of nodes as "half the amount of coordinates" but what the code actually does is choose half the number of unique edges. There could well be a difference if, say, we had a thousand coordinate pairs that each referred to the same two edges.

Understand standard structures

There is a much simpler way to create your unordered_map that doesn't require any helper function. Because each value is initialized to zero, we can rely on this and simplify the code considerably:

std::unordered_map<int, int> node_degree;
for (const auto &coord : coordinates) {
 ++node_degree[coord.x];
 ++node_degree[coord.y];
}

Use appropriate data structures

Your use of an unordered_map is a sound idea, but I'm not so sure about the conversion to a std::vector. It seems to me that a std::priority_queue more directly matches what you're trying to accomplish. Here's the equivalent to your select_coordinates_highest_occurrence (which I find a bit wordy):

void popularEdges(const std::vector<Coordinate> & coordinates) {
 std::unordered_map<int, int> m;
 for (const auto &coord : coordinates) {
 ++m[coord.x];
 ++m[coord.y];
 }
 std::priority_queue<std::pair<int, int>> pq;
 for (const auto &node : m) {
 pq.emplace(node.second, node.first);
 }
 for (auto n=pq.size()/2; n; --n) {
 std::cout << pq.top().second << '\n';
 pq.pop();
 }
}

In my testing on my computer, this version is 25% faster than the original.

Separate I/O from processing

It may depend on your particular needs, but I'd suggest that passing back a vector would be much more generally useful than simply printing from within the function. Similarly, the read_coordinates_into_vector prints the file name back to the user. I'd expect instead that it would be silent. The caller can just as easily print that string if needed.

Make return variables return variables

The read_coordinates_into_vector currently requires the user to pass in a vector, but what would likely make more sense would be to have the code return a newly created vector instead.

Don't use spurious `typedef`

C++ is not C. The typedef here is not needed:

typedef struct Coordinate { /*...*/ };

because in C++, Coordinate is a proper type without the additional keyword.

Prefer `std::istream` to file names

The current code for read_coordinates_into_vector is only capable of handing a named file and not, say, a std::stringstream. It could be made more generic and, I think, more durable, if the function prototype were this:

std::vector<Coordinate> read_coordinates_into_vector(std::istream &in);

Use a `friend` function and custom inserter

Use a friend function and custom inserter to simplify reading coordinates from a file:

struct Coordinate {
 int x;
 int y;
 friend std::istream &operator>>(std::istream &in, Coordinate &c) {
 return in >> c.x >> c.y;
 }
};
std::vector<Coordinate> read(std::istream &in) {
 std::vector<Coordinate> v;
 Coordinate pair;
 while (in >> pair) {
 v.push_back(edge);
 }
 return v;
}

Node: If a key doesn't exist in an std::unordered_map, it is not zero initialized, but value initialized. Which is technically not the same thing.
I've changed the wording to simply say that it is "initialized to zero" without specifying the particular mechanism.
I do believe typedef struct Coordinate { /*...*/ }; is an error. You dropped a second Coordinate.
It's not exactly an error; it's just weird. The syntax in the original is legal but strange C++ and the version I show above is legal C++ and not strange. Dropping the second Coordinate is required or it would be a syntax error when Coordinate is used later.

Stack Exchange Network

Find the most-connected nodes in a graph

Problem and goal

Input file

Desired output

Constraints

My current approach and code

3 Answers 3

Clarify your specification

Understand standard structures

Use appropriate data structures

Separate I/O from processing

Make return variables return variables

Don't use spurious `typedef`

Prefer `std::istream` to file names

Use a `friend` function and custom inserter

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Find the most-connected nodes in a graph

Problem and goal

Input file

Desired output

Constraints

My current approach and code

3 Answers 3

Clarify your specification

Understand standard structures

Use appropriate data structures

Separate I/O from processing

Make return variables return variables

Don't use spurious typedef

Prefer std::istream to file names

Use a friend function and custom inserter

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Don't use spurious `typedef`

Prefer `std::istream` to file names

Use a `friend` function and custom inserter