Find the K most-frequent values in a vector

Question 1

I am solving this problem that requires returning the most frequent K elements, and even though I have a solution, I am wondering if there's an improvement that can be made.

It runs O(N) to store the frequency of each element in a std::map. The second loop runs another O(N) over std::map to populate the std::priority_queue followed by removing the top K elements from the std::priority_queue.

Ideally, I'd have a container that orders the elements based on the frequency (not sure how I can get std::map to do so that orders based on the keys). I just feel there's unnecessary work being done specially if all one needs is the top 1 element but the array itself is massive so you end up adding all those unnecessary elements that won't serve any purpose in determining the answer.

 vector<int> TopKFreqElem(vector<int> nums, int k) 
 {
 std::map<int, int> counter;
 std::vector<int> freqElems;
 for (auto& elem : nums)
 {
 counter[elem]++;
 }
 std::priority_queue<pair<int,int>> pq;
 for (auto e : counter)
 {
 pq.push({e.second, e.first});
 }
 while (!pq.empty())
 {
 auto [freq, value] = pq.top();
 freqElems.push_back(value);
 pq.pop();
 if (!--k) break;
 }
 return freqElems;
 }

Question 2

Algorithmic complexity

It runs O(N) to store the frequency of each element in a std::map.

No, it runs in \$O(N \log N)\$ time to populate the map of frequencies, since lookups into std::map are \$O(\log N)\$.

The second loop runs another O(N) over std::map to populate the std::priority_queue

Here also you forgot to account for the complexity of pushing into std::priority_queue, which is \$O(\log N)\$ for each push (amortized, since the underlying std::vector might have to resize from time to time).

followed by removing the top K elements from the std::priority_queue.

Which is \$O(K \log N)\$, since each pop from std::priority_queue is \$O(\log N)\$, and inserting into the vector of results is \$O(1)\$ (amortized).

Together your whole function runs in \$O(N \log N)\$ time.

Improving performance

The first thing I would do is change the std::map to a std::unordered_map, as the latter one has \$O(1)\$ insertions and lookups. That will speed up the frequency counting at least.

I just feel there's unnecessary work being done specially if all one needs is the top 1 element

Indeed, if you only need the top 1 element you would not bother with the priority queue at all. Instead, while counting, you could keep track of which counter is the largest.

If \$K\$ is small, you could think of keeping track of just the \$K\$ most frequent elements while counting. That's similar to what is described in this StackOverflow post. But Toby's second answer shows another way to do it.

Ideally, I'd have a container that orders the elements based on the frequency (not sure how I can get std::map to do so that orders based on the keys).

You can do that, but every time the frequency of an element is updated, you'd have to remove that element from the map and re-insert it to maintain the correct order. That's a \$O(\log N\$) operation each time you increase a counter, so it's not going to be faster than the algorithm mentioned above.

Question 3

A container that reorders itself would just have to swap one element with another when its count is updated enough to advance its rank. But you'd somehow need to find the element by value, which is going to have its own overhead. It might be possible if we maintain a map of references, and update that map (in O(log N)) whenever we have to swap elements.

Question 4

@TobySpeight True, I did not consider that incrementing by one is kind of a special case that can be implemented more optimal in a priority queue, but you still have the lookup. If updating that map of references is \$O(\log N)\$ it doesn't buy you anything complexity-wise, but maybe it does improve the constant factor?

Question 5

That O(N log K) algorithm is what's implemented in this sample implementation of std::partial_sort_copy. (The N on that page is our K; L1 is our N).

Question 6

I worked out how to reorder the container without losing track of its elements (using std::list), after a long battle with reverse iterators. I posted that as an additional answer for anyone interested.

Question 7

1) It runs O(N) to store the frequency of each element in a std::map.. Sorry I misworded. I meant O(N) to run over the vector as we append to std::map. 2) Which is O(KlogN), since each pop from std::priority_queue is O(logN). When could it be O(KlogN) vs O(KlogN)? 3) regarding the SO link, I am able to understand a bit of what's going on but not the whole picture: is max referring to the topmost element in the heap here or just a local variable? a sample code shall help a lot

Question 8

You're missing some necessary headers:

#include <map>
#include <queue>
#include <utility>
#include <vector>
using std::pair;
using std::vector;

I recommend ditching those using statements and just writing those types in full (not only is that shorter, it's also much clearer).

The function accepts nums by value, but I don't think we need to copy this vector:

std::vector<int> TopKFreqElem(const std::vector<int>& nums, int k)

It doesn't make sense for k to be negative. It's probably best to make it consistent with the size of a vector (i.e. std::size_t).

We could make the function more general, as a template that accepts any range type, and writing to an iterator:

template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void TopKFreqElem(const R& nums, std::size_t k, It out)

Counting using std::map is a good technique, but we should use std::size_t for the counts.

The implementation using std::priority_queue is okay, except that we should test k at the beginning of the loop, in case 0 is passed in: while (k-- && !pq.empty()).

One simple performance improvement would be to reserve() space for k elements in the vector before we start inserting.

However, the standard library gives us std::partial_sort() which we can use to get the k lowest (or, with suitable comparator, the k highest) elements. Even better for our purposes is std::partial_sort_copy(), which doesn't require a copy of the whole counter.

Modified code

#include <algorithm>
#include <map>
#include <iterator>
#include <ranges>
#include <utility>
#include <vector>
template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void copy_most_frequent(const R& nums, It out, std::size_t k)
{
 using Counter = std::map<typename R::value_type, std::size_t>;
 Counter counter;
 for (auto& elem : nums) {
 ++counter[elem];
 }
 using result_type = std::pair<typename R::value_type, std::size_t>;
 std::vector<result_type> result(k);
 std::ranges::partial_sort_copy(counter, result, std::ranges::greater{},
 &Counter::value_type::second, &result_type::second);
 std::ranges::transform(result, out, &result_type::first);
}

And a small test program:

#include <array>
#include <iostream>
#include <string>
#include <string_view>
int main()
{
 std::vector v{1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4};
 copy_most_frequent(v, std::ostream_iterator<int>(std::cout, "\n"),
 2);
 const std::array<std::string, 5> a{"one", "two", "two", "three", "four"};
 copy_most_frequent(a, std::ostream_iterator<std::string_view>(std::cout, "\n"),
 1);
}

Algorithmic complexity is still O(N log N), but we have simpler implementation and greater flexibility.

For value types that can be hashed, we could improve the count phase using std::unordered_map:

template<typename T>
concept is_hashable = requires { std::hash<T>{}; };

 using value_type = typename R::value_type;
 using Counter =
 std::conditional_t<is_hashable<value_type>,
 std::unordered_map<value_type, std::size_t>,
 std::map<value_type, std::size_t>>;
 Counter counter;

That reduces the count phase to O(N), so the whole function is now dominated by the sort, which is O(N log K).

Question 9

General points

Use a reference std::vector<int>& nums rather than copy all the values std::vector<int> nums. Best to make nums a constant as well.
Argument k can be a constant.(see rewrite replacing k with size_t n)
Argument k represents a size / count and should not be an int as a negative count makes no sense in this context. Best to use a type std::size_t Same applies to all the various counters you have.
Its a bad habit to not delimit code blocks. eg if (!--k) break; is better as if (!--k) { break; }
I am not a fan of auto others will insist it's better than sliced bread. My opinion is that it obscures type and is bad for code readability. Don't use auto
Why use a reference in for (auto& elem : nums) {? could be as simple as for (int num : nums). Also the name elem is a little odd, maybe num would be better?
You can use the simpler unordered_map to count instances of each num.
counter name as singular? Maybe better named as counters?
Populating the final result (while loop at end) can be simplified. Though the compiler will optimise, no point having code get in the way of readability that is never compiled into running code. (see rewrite for simply version)

Rewrite

Removes use of auto. (up to you)
Rearranges declarations.
Uses size_t for all counting.
Some renaming.
Open blocks on same line (keeping up with modern C like language layouts)

#include <unordered_map>
#include <queue>
#include <vector>
std::vector<int> TopKFreqElem(const std::vector<int>& nums, const std::size_t k) {
 std::unordered_map<int, std::size_t> counts;
 for (int num : nums) { counts[num]++; }
 std::priority_queue<std::pair<std::size_t, int>> frequent;
 for (const std::pair<const int, const std::size_t>& p : counts) {
 frequent.push({p.second, p.first});
 }
 
 std::size_t n = std::min(frequent.size(), k);
 std::vector<int> topKNums;
 while (n--) {
 topKNums.push_back(frequent.top().second);
 frequent.pop();
 }
 return topKNums;
}

Question 10

It's probably worth topKNums.reserve(n) in there before you start appending elements.

Question 11

Since you said

Ideally, I'd have a container that orders the elements based on the frequency

It is actually possible to do that, and it might be useful if you want to track the most frequent values as each element is added. The trick is to keep the counts in a list, and maintain the order of the list so that it's in descending order of count.

To find the list elements, we use a map (or unordered map) from value to an iterator into the list - this is safe because our list operations will not invalidate the iterators. It's cheap - O(1) - to reorder nodes within a list using splice(), so we have a container where adding elements is O(1) if they are hashable, else O(log N) if sortable.

The implementation looks like this (just implementing the two methods needed here - adding clear(), remove() etc. is left as an exercise for the interested reader):

#include <algorithm>
#include <list>
#include <map>
#include <iterator>
#include <ranges>
#include <unordered_map>
#include <vector>
template<typename T>
concept is_hashable = requires { std::hash<T>{}; };

template<typename T>
class frequency_ordered_multiset
{
 struct item { const T val; std::size_t count; };
 // We use a list because its iterators stay valid after modifications.
 // The list is always in descending order of count.
 std::list<item> items = {};
 using items_iter = std::list<item>::iterator;
 std::conditional_t<is_hashable<T>,
 std::unordered_map<T, items_iter>,
 std::map<T, items_iter>>
 by_value = {};
public:
 void insert(const T& val)
 {
 auto map_it = by_value.find(val);
 if (map_it == by_value.end()) {
 // first appearance - put it at end of list
 items.emplace_back(val, 1);
 auto list_it = items.end();
 by_value.emplace(val, --list_it);
 return;
 }
 // update existing item
 auto const list_it = map_it->second;
 auto const new_count = ++list_it->count;
 // search backwards for higher or equal count
 auto const other_it =
 std::ranges::lower_bound(std::reverse_iterator{list_it}, items.rend(),
 new_count, std::less{}, &item::count).base();
 if (other_it != list_it) {
 // move the item (without invalidating iterators)
 items.splice(other_it, items, list_it);
 }
 }
 auto most_frequent(std::size_t k)
 {
 return items
 | std::views::take(k)
 | std::views::transform(&item::val);
 }
};

#include <array>
#include <iostream>
int main()
{
 static const std::array v{1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 3, 4, 4};
 frequency_ordered_multiset<int> fom;
 for (auto i: v) {
 fom.insert(i);
 std::ranges::copy(fom.most_frequent(2),
 std::ostream_iterator<int>(std::cout, ", "));
 std::cout << '\n';
 }
}

Question 12

I posted this code for review as Container that orders its elements based on their frequency.

Question 13

Consider simplicity:

Get a copy of the range in a vector.
Sort it.
For each chunk of equivalent elements, push a count,element pair on a limited size heap.
Return the elements in the heap.

The big advantages are nearly eliminated allocation-overhead, and much improved locality of reference.

Though if there are few unique elements which are oft-repeated, using a hashmap probably wins.

G. Sliepen G. Sliepen 68.8k3 gold badges74 silver badges179 bronze badges · Answer 1 · 2022-12-14 10:31:53Z

Algorithmic complexity

It runs O(N) to store the frequency of each element in a std::map.

No, it runs in \$O(N \log N)\$ time to populate the map of frequencies, since lookups into std::map are \$O(\log N)\$.

The second loop runs another O(N) over std::map to populate the std::priority_queue

Here also you forgot to account for the complexity of pushing into std::priority_queue, which is \$O(\log N)\$ for each push (amortized, since the underlying std::vector might have to resize from time to time).

followed by removing the top K elements from the std::priority_queue.

Which is \$O(K \log N)\$, since each pop from std::priority_queue is \$O(\log N)\$, and inserting into the vector of results is \$O(1)\$ (amortized).

Together your whole function runs in \$O(N \log N)\$ time.

Improving performance

The first thing I would do is change the std::map to a std::unordered_map, as the latter one has \$O(1)\$ insertions and lookups. That will speed up the frequency counting at least.

I just feel there's unnecessary work being done specially if all one needs is the top 1 element

Indeed, if you only need the top 1 element you would not bother with the priority queue at all. Instead, while counting, you could keep track of which counter is the largest.

If \$K\$ is small, you could think of keeping track of just the \$K\$ most frequent elements while counting. That's similar to what is described in this StackOverflow post. But Toby's second answer shows another way to do it.

Ideally, I'd have a container that orders the elements based on the frequency (not sure how I can get std::map to do so that orders based on the keys).

You can do that, but every time the frequency of an element is updated, you'd have to remove that element from the map and re-insert it to maintain the correct order. That's a \$O(\log N\$) operation each time you increase a counter, so it's not going to be faster than the algorithm mentioned above.

A container that reorders itself would just have to swap one element with another when its count is updated enough to advance its rank. But you'd somehow need to find the element by value, which is going to have its own overhead. It might be possible if we maintain a map of references, and update that map (in O(log N)) whenever we have to swap elements.
@TobySpeight True, I did not consider that incrementing by one is kind of a special case that can be implemented more optimal in a priority queue, but you still have the lookup. If updating that map of references is \$O(\log N)\$ it doesn't buy you anything complexity-wise, but maybe it does improve the constant factor?
That O(N log K) algorithm is what's implemented in this sample implementation of std::partial_sort_copy. (The N on that page is our K; L1 is our N).
I worked out how to reorder the container without losing track of its elements (using std::list), after a long battle with reverse iterators. I posted that as an additional answer for anyone interested.
1) It runs O(N) to store the frequency of each element in a std::map.. Sorry I misworded. I meant O(N) to run over the vector as we append to std::map. 2) Which is O(KlogN), since each pop from std::priority_queue is O(logN). When could it be O(KlogN) vs O(KlogN)? 3) regarding the SO link, I am able to understand a bit of what's going on but not the whole picture: is max referring to the topmost element in the heap here or just a local variable? a sample code shall help a lot

Toby Speight Toby Speight 87.5k14 gold badges104 silver badges322 bronze badges · Answer 2 · 2022-12-14 10:31:34Z

You're missing some necessary headers:

#include <map>
#include <queue>
#include <utility>
#include <vector>
using std::pair;
using std::vector;

I recommend ditching those using statements and just writing those types in full (not only is that shorter, it's also much clearer).

The function accepts nums by value, but I don't think we need to copy this vector:

std::vector<int> TopKFreqElem(const std::vector<int>& nums, int k)

It doesn't make sense for k to be negative. It's probably best to make it consistent with the size of a vector (i.e. std::size_t).

We could make the function more general, as a template that accepts any range type, and writing to an iterator:

template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void TopKFreqElem(const R& nums, std::size_t k, It out)

Counting using std::map is a good technique, but we should use std::size_t for the counts.

The implementation using std::priority_queue is okay, except that we should test k at the beginning of the loop, in case 0 is passed in: while (k-- && !pq.empty()).

One simple performance improvement would be to reserve() space for k elements in the vector before we start inserting.

However, the standard library gives us std::partial_sort() which we can use to get the k lowest (or, with suitable comparator, the k highest) elements. Even better for our purposes is std::partial_sort_copy(), which doesn't require a copy of the whole counter.

Modified code

#include <algorithm>
#include <map>
#include <iterator>
#include <ranges>
#include <utility>
#include <vector>
template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void copy_most_frequent(const R& nums, It out, std::size_t k)
{
 using Counter = std::map<typename R::value_type, std::size_t>;
 Counter counter;
 for (auto& elem : nums) {
 ++counter[elem];
 }
 using result_type = std::pair<typename R::value_type, std::size_t>;
 std::vector<result_type> result(k);
 std::ranges::partial_sort_copy(counter, result, std::ranges::greater{},
 &Counter::value_type::second, &result_type::second);
 std::ranges::transform(result, out, &result_type::first);
}

And a small test program:

#include <array>
#include <iostream>
#include <string>
#include <string_view>
int main()
{
 std::vector v{1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4};
 copy_most_frequent(v, std::ostream_iterator<int>(std::cout, "\n"),
 2);
 const std::array<std::string, 5> a{"one", "two", "two", "three", "four"};
 copy_most_frequent(a, std::ostream_iterator<std::string_view>(std::cout, "\n"),
 1);
}

Algorithmic complexity is still O(N log N), but we have simpler implementation and greater flexibility.

For value types that can be hashed, we could improve the count phase using std::unordered_map:

template<typename T>
concept is_hashable = requires { std::hash<T>{}; };

 using value_type = typename R::value_type;
 using Counter =
 std::conditional_t<is_hashable<value_type>,
 std::unordered_map<value_type, std::size_t>,
 std::map<value_type, std::size_t>>;
 Counter counter;

That reduces the count phase to O(N), so the whole function is now dominated by the sort, which is O(N log K).

Blindman67 Blindman67 22.8k2 gold badges16 silver badges40 bronze badges · Answer 3 · 2022-12-14 16:50:12Z

General points

Use a reference std::vector<int>& nums rather than copy all the values std::vector<int> nums. Best to make nums a constant as well.
Argument k can be a constant.(see rewrite replacing k with size_t n)
Argument k represents a size / count and should not be an int as a negative count makes no sense in this context. Best to use a type std::size_t Same applies to all the various counters you have.
Its a bad habit to not delimit code blocks. eg if (!--k) break; is better as if (!--k) { break; }
I am not a fan of auto others will insist it's better than sliced bread. My opinion is that it obscures type and is bad for code readability. Don't use auto
Why use a reference in for (auto& elem : nums) {? could be as simple as for (int num : nums). Also the name elem is a little odd, maybe num would be better?
You can use the simpler unordered_map to count instances of each num.
counter name as singular? Maybe better named as counters?
Populating the final result (while loop at end) can be simplified. Though the compiler will optimise, no point having code get in the way of readability that is never compiled into running code. (see rewrite for simply version)

Rewrite

Removes use of auto. (up to you)
Rearranges declarations.
Uses size_t for all counting.
Some renaming.
Open blocks on same line (keeping up with modern C like language layouts)

#include <unordered_map>
#include <queue>
#include <vector>
std::vector<int> TopKFreqElem(const std::vector<int>& nums, const std::size_t k) {
 std::unordered_map<int, std::size_t> counts;
 for (int num : nums) { counts[num]++; }
 std::priority_queue<std::pair<std::size_t, int>> frequent;
 for (const std::pair<const int, const std::size_t>& p : counts) {
 frequent.push({p.second, p.first});
 }
 
 std::size_t n = std::min(frequent.size(), k);
 std::vector<int> topKNums;
 while (n--) {
 topKNums.push_back(frequent.top().second);
 frequent.pop();
 }
 return topKNums;
}

It's probably worth topKNums.reserve(n) in there before you start appending elements.

Toby Speight Toby Speight 87.5k14 gold badges104 silver badges322 bronze badges · Answer 4 · 2022-12-14 17:26:31Z

Since you said

Ideally, I'd have a container that orders the elements based on the frequency

It is actually possible to do that, and it might be useful if you want to track the most frequent values as each element is added. The trick is to keep the counts in a list, and maintain the order of the list so that it's in descending order of count.

To find the list elements, we use a map (or unordered map) from value to an iterator into the list - this is safe because our list operations will not invalidate the iterators. It's cheap - O(1) - to reorder nodes within a list using splice(), so we have a container where adding elements is O(1) if they are hashable, else O(log N) if sortable.

The implementation looks like this (just implementing the two methods needed here - adding clear(), remove() etc. is left as an exercise for the interested reader):

#include <algorithm>
#include <list>
#include <map>
#include <iterator>
#include <ranges>
#include <unordered_map>
#include <vector>
template<typename T>
concept is_hashable = requires { std::hash<T>{}; };

template<typename T>
class frequency_ordered_multiset
{
 struct item { const T val; std::size_t count; };
 // We use a list because its iterators stay valid after modifications.
 // The list is always in descending order of count.
 std::list<item> items = {};
 using items_iter = std::list<item>::iterator;
 std::conditional_t<is_hashable<T>,
 std::unordered_map<T, items_iter>,
 std::map<T, items_iter>>
 by_value = {};
public:
 void insert(const T& val)
 {
 auto map_it = by_value.find(val);
 if (map_it == by_value.end()) {
 // first appearance - put it at end of list
 items.emplace_back(val, 1);
 auto list_it = items.end();
 by_value.emplace(val, --list_it);
 return;
 }
 // update existing item
 auto const list_it = map_it->second;
 auto const new_count = ++list_it->count;
 // search backwards for higher or equal count
 auto const other_it =
 std::ranges::lower_bound(std::reverse_iterator{list_it}, items.rend(),
 new_count, std::less{}, &item::count).base();
 if (other_it != list_it) {
 // move the item (without invalidating iterators)
 items.splice(other_it, items, list_it);
 }
 }
 auto most_frequent(std::size_t k)
 {
 return items
 | std::views::take(k)
 | std::views::transform(&item::val);
 }
};

#include <array>
#include <iostream>
int main()
{
 static const std::array v{1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 3, 4, 4};
 frequency_ordered_multiset<int> fom;
 for (auto i: v) {
 fom.insert(i);
 std::ranges::copy(fom.most_frequent(2),
 std::ostream_iterator<int>(std::cout, ", "));
 std::cout << '\n';
 }
}

I posted this code for review as Container that orders its elements based on their frequency.

Deduplicator Deduplicator 19.6k1 gold badge32 silver badges65 bronze badges · Answer 5 · 2022-12-15 18:06:44Z

Consider simplicity:

Get a copy of the range in a vector.
Sort it.
For each chunk of equivalent elements, push a count,element pair on a limited size heap.
Return the elements in the heap.

The big advantages are nearly eliminated allocation-overhead, and much improved locality of reference.

Though if there are few unique elements which are oft-repeated, using a hashmap probably wins.

Stack Exchange Network

Find the K most-frequent values in a vector

5 Answers 5

Algorithmic complexity

Improving performance

Modified code

General points

Rewrite

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

Find the K most-frequent values in a vector

5 Answers 5

Algorithmic complexity

Improving performance

Modified code

General points

Rewrite

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related

Hot Network Questions