I feel like this is a simple problem that I have overcomplicated. Any tips or different approach would be greatly appreciated.
Problem and goal
Given a list of coordinates (each line represents an 'edge' containing 'nodes') my goal is to obtain (half of the amount of the total) nodes that occur the most. I hope my example would clarify this problem:
Input file
Each line represents a 'coordinate' (in reality an 'Edge'), that consists of 2 nodes (integers).
20 1
7 15
7 1
7 20
7 4
7 19
Desired output
Preferably a vector that contains: [7,20,1]
Why 7,20,1? Because 7, 20 and 1 occur the most if you look at the input file. This is also half (3) of the total amount (6) of coordinates. Note: the output vector doesn't have to be sorted.
My goal is to increase the performance of my current approach (any other approach however is very appreciated (but please check my constraints in the next section). The current performance seems to be slow for 1 million edges (lines). An example of such a large coordinates file can be obtained from here (8 MB file).
Constraints
- The input file is being read and the results are added as a vector of struct with type
Coordinate
(as my code will show). Please note that I can not change this (due to constraints that are beyond the scope of my question). In other words, the input parameter to my function is avector<Coordinate>
. - I would like to obtain an array/vector as an output of the highest occurrences.
My current approach and code
Essentially, I am 'looping' through my coordinates in the select_coordinates_highest_occurrence
function and add each coordinate (along with its first and second element) to a map (function add_node_to_node_degree_map
) that keeps track of the amount of the occurrences. If I do a simple benchmark, this seems to also be the main bottleneck of the code.
I then proceed by converting the map to a vector, since I am unable to sort the vector by the value.
After I have done that, I sort the vector (ascending) and 'cut' it in half so that I obtain the highest amount of coordinates.
#include <iostream>
#include <sstream>
#include <vector>
#include <fstream>
#include <string>
#include <unordered_map>
#include <algorithm>
typedef struct Coordinate {
int x;
int y;
} Coordinate;
void add_node_to_node_degree_map(std::unordered_map<int, int>& node_degree, int node) {
if (!node_degree.count(node)) {
node_degree[node] = 1;
}
else {
node_degree[node] = node_degree[node] + 1;
}
}
void select_coordinates_highest_occurrence(std::vector<Coordinate>& coordinates) {
// Map all vertices onto a map along with their degree
std::unordered_map<int, int> node_degree;
for (auto &edge : coordinates) {
add_node_to_node_degree_map(node_degree, edge.x);
add_node_to_node_degree_map(node_degree, edge.y);
}
// Convert the map to a vector
std::vector<std::pair<int, int>> node_degree_vect;
for (auto it = node_degree.cbegin(); it != node_degree.cend(); ++it)
{
node_degree_vect.push_back(std::make_pair(it->first, it->second));
}
// Sort the vector (ascending, high degree nodes are on top)
std::sort(node_degree_vect.begin(), node_degree_vect.end(), [](const std::pair<int, int> &left, const std::pair<int, int> &right) {
return left.second > right.second;
});
// Only collect the high occurrences (delete the 'other' half of the vector with lower occurrences)
std::vector<int> high_occurrences;
for (int i = 0; i < node_degree_vect.size() / 2; i++) {
high_occurrences.push_back(node_degree_vect[i].first);
}
for (int i = 0; i < high_occurrences.size(); i++) {
printf("\n%d", high_occurrences[i]);
}
}
/*
This method can be ignored and is only added to load a graph in case you want to test it on your system.
*/
void read_coordinates_into_vector(std::vector<Coordinate>& coordinates, std::string file_path) {
std::ifstream infile(file_path);
std::cout << file_path;
int vertex_source;
int vertex_destination;
while (infile >> vertex_source >> vertex_destination) {
Coordinate edge;
edge.x = vertex_source;
edge.y = vertex_destination;
coordinates.push_back(edge);
}
infile.close();
}
int main() {
std::vector<Coordinate> coordinates;
std::string file_path = "C:\\Users\\USERNAMEHERE\\Desktop\\coordinates_list.txt";
read_coordinates_into_vector(coordinates,file_path);
select_coordinates_highest_occurrence(coordinates);
return 0;
}
I really feel like I overcomplicated this solution.
3 Answers 3
Here are some things that may help you improve your program.
Clarify your specification
In the text, you describe the desired quantity of nodes as "half the amount of coordinates" but what the code actually does is choose half the number of unique edges. There could well be a difference if, say, we had a thousand coordinate pairs that each referred to the same two edges.
Understand standard structures
There is a much simpler way to create your unordered_map
that doesn't require any helper function. Because each value is initialized to zero, we can rely on this and simplify the code considerably:
std::unordered_map<int, int> node_degree;
for (const auto &coord : coordinates) {
++node_degree[coord.x];
++node_degree[coord.y];
}
Use appropriate data structures
Your use of an unordered_map
is a sound idea, but I'm not so sure about the conversion to a std::vector
. It seems to me that a std::priority_queue
more directly matches what you're trying to accomplish. Here's the equivalent to your select_coordinates_highest_occurrence
(which I find a bit wordy):
void popularEdges(const std::vector<Coordinate> & coordinates) {
std::unordered_map<int, int> m;
for (const auto &coord : coordinates) {
++m[coord.x];
++m[coord.y];
}
std::priority_queue<std::pair<int, int>> pq;
for (const auto &node : m) {
pq.emplace(node.second, node.first);
}
for (auto n=pq.size()/2; n; --n) {
std::cout << pq.top().second << '\n';
pq.pop();
}
}
In my testing on my computer, this version is 25% faster than the original.
Separate I/O from processing
It may depend on your particular needs, but I'd suggest that passing back a vector would be much more generally useful than simply printing from within the function. Similarly, the read_coordinates_into_vector
prints the file name back to the user. I'd expect instead that it would be silent. The caller can just as easily print that string if needed.
Make return variables return variables
The read_coordinates_into_vector
currently requires the user to pass in a vector, but what would likely make more sense would be to have the code return a newly created vector instead.
Don't use spurious typedef
C++ is not C. The typedef
here is not needed:
typedef struct Coordinate { /*...*/ };
because in C++, Coordinate
is a proper type without the additional keyword.
Prefer std::istream
to file names
The current code for read_coordinates_into_vector
is only capable of handing a named file and not, say, a std::stringstream
. It could be made more generic and, I think, more durable, if the function prototype were this:
std::vector<Coordinate> read_coordinates_into_vector(std::istream &in);
Use a friend
function and custom inserter
Use a friend function and custom inserter to simplify reading coordinates from a file:
struct Coordinate {
int x;
int y;
friend std::istream &operator>>(std::istream &in, Coordinate &c) {
return in >> c.x >> c.y;
}
};
std::vector<Coordinate> read(std::istream &in) {
std::vector<Coordinate> v;
Coordinate pair;
while (in >> pair) {
v.push_back(edge);
}
return v;
}
-
1\$\begingroup\$ Node: If a key doesn't exist in an
std::unordered_map
, it is not zero initialized, but value initialized. Which is technically not the same thing. \$\endgroup\$Rakete1111– Rakete11112017年06月07日 15:32:48 +00:00Commented Jun 7, 2017 at 15:32 -
\$\begingroup\$ I've changed the wording to simply say that it is "initialized to zero" without specifying the particular mechanism. \$\endgroup\$Edward– Edward2017年06月07日 15:44:21 +00:00Commented Jun 7, 2017 at 15:44
-
\$\begingroup\$ I do believe
typedef struct Coordinate { /*...*/ };
is an error. You dropped a secondCoordinate
. \$\endgroup\$Deduplicator– Deduplicator2017年06月07日 21:47:16 +00:00Commented Jun 7, 2017 at 21:47 -
\$\begingroup\$ It's not exactly an error; it's just weird. The syntax in the original is legal but strange C++ and the version I show above is legal C++ and not strange. Dropping the second
Coordinate
is required or it would be a syntax error whenCoordinate
is used later. \$\endgroup\$Edward– Edward2017年06月07日 21:50:45 +00:00Commented Jun 7, 2017 at 21:50
add_node_to_node_degree_map
always does double the work needed, specifically two lookups (the second might be inserting) where one would suffice.
Easily fixed:
void add_node_to_node_degree_map(std::unordered_map<int, int>& node_degree, int node) {
++node_degree[node];
}
You can directly initialize your node_degree_vect
from node_degree
:
std::vector<std::pair<int, int>> node_degree_vect(node_degree.begin(), node_degree.end());
Anyway, I wonder why you didn't at least use a for-range-loop there.
If you only collect the result in a std::vector
to immediately output it, I wonder why you didn't just output it and skipped the intermediate step.
I wonder why you explicitly close a file manually just for RAII to kick in and do that in the destructor...
On my machine (i7 6700HQ running gcc 7.1.1
and compiled with -O3
) your code runs in the following time:
./main 0.57s user 0.18s system 89% cpu 0.834 total
That's pretty good. No matter what I tried, I always got a value around 0.5s
, so I can't provide anything that can speed up your code, sorry.
Compile with every warning turned on. In my case,
g++
reports 2 things:main.cpp: In function 'void select_coordinates_highest_occurrence(std::vector<Coordinate>&)': main.cpp:48:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int i = 0; i < node_degree_vect.size() / 2; i++) { ~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ main.cpp:52:23: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] for (int i = 0; i < high_occurrences.size(); i++) { ~~^~~~~~~~~~~~~~~~~~~~~~~~~
Please fix them.
I don't know if this goes against your requirement, but
typedef
ing structs is only a thing in C. In C++, there is no difference, so just leave it out.Mark structs that shouldn't be used as base classes
final
, to prevent any potential bugs about a non-virtual destructor.My opinion: Using integers as booleans is confusing. So, I would change
if (!node_degree.count(node))
to
if (node_degree.count(node) == 0)
If you read
std::unordered_map::operator[]
, you'll see that it inserts a default constructed element if it doesn't exist. So,add_node_to_node_degree_map
can be simplified to:++node_degree[node];
Which also means that you don't need a function for it.
Every binary arithmetic operator (let's call one
%%
) can be rewritten if it has the formnum = num %% a; // 'a' is another number
Using
%%=
:num %%= a;
Example,
node_degree[node] = node_degree[node] + 1;
can becomenode_degree[node] += 1;
, or in this case, even++node_degree[node];
because you're only adding one.Mark functions that do not throw as
noexcept
to help the compiler optimize (if you compile without exceptions, ignore this.)Your naming is a bit too specific, and as a result, they are long, especially function names. Try to find shorter names, but not too short :)
I can only see one
printf
, so I'm not sure on this one:If you always use
printf
, even if you are outputting an integer, then don't, because you are using the C function, which is not guaranteed to be included. In this case,#include <cstdio>
and change it tostd::printf
.If you removed it because you know that ADL will find
printf
anyways instd
in this case, good job :) Still, you need to#include <cstdio>
.
Use more range-based for loops:
for (const auto &node : node_degree) node_degree_vect.push_back(std::make_pair(node.first, node.second));
You are inconsistent with braces, but only in one case. The one replaced above is the only one with opening braces being on a separate line.
int
is not guaranteed to cover the whole range ofstd::size_t
, which is returned by most.size()
calls. So, you'll need to usestd::size_t
instead ofint
in loops:for (std::size_t i = 0; i < node_degree_vect.size() / 2; ++i) { high_occurrences.push_back(node_degree_vect[i].first); }
Prefer pre-increment to post-increment. Even though for
int
s this is negligible and the compiler will optimize the temporary away, this might not be the case for other heavier types (like iterators.)Use
auto
.[](const std::pair<int, int> &left, const std::pair<int, int> &right) { return left.second > right.second; }
can become
[](const auto &left, const auto &right) { return left.second > right.second; }
I think you didn't write this, but if you did:
std::ifstream
's destructor closes the file, so you don't need to at the end.A function returning
void
and taking a reference to a type is not very nice. Consider returning the variable instead of taking it by reference. In this case, you can initializecoordinates
in 1 line instead of 2.Print a trailing new line please.
select_coordinates_highest_occurrence
doesn't modifycoordinates
. So please mark itconst
to prevent any accidental modifications.You don't need
return 0;
formain
, you can safely remove it.
2 1
and not2 0 1
. Is this desired or did I break something? \$\endgroup\$2 0 1
for me too. \$\endgroup\$2 0 1
for me, but the 1 might be a little bit harder to see since it is next to the 'Press any key to continue' message to exit the command line terminal (on Windows at least) \$\endgroup\$0
. Apologies. \$\endgroup\$