I am solving this problem that requires returning the most frequent K elements, and even though I have a solution, I am wondering if there's an improvement that can be made.
It runs O(N) to store the frequency of each element in a std::map
. The second loop runs another O(N) over std::map
to populate the std::priority_queue
followed by removing the top K elements from the std::priority_queue
.
Ideally, I'd have a container that orders the elements based on the frequency (not sure how I can get std::map
to do so that orders based on the keys). I just feel there's unnecessary work being done specially if all one needs is the top 1 element but the array itself is massive so you end up adding all those unnecessary elements that won't serve any purpose in determining the answer.
vector<int> TopKFreqElem(vector<int> nums, int k)
{
std::map<int, int> counter;
std::vector<int> freqElems;
for (auto& elem : nums)
{
counter[elem]++;
}
std::priority_queue<pair<int,int>> pq;
for (auto e : counter)
{
pq.push({e.second, e.first});
}
while (!pq.empty())
{
auto [freq, value] = pq.top();
freqElems.push_back(value);
pq.pop();
if (!--k) break;
}
return freqElems;
}
5 Answers 5
Algorithmic complexity
It runs O(N) to store the frequency of each element in a std::map.
No, it runs in \$O(N \log N)\$ time to populate the map of frequencies, since lookups into std::map
are \$O(\log N)\$.
The second loop runs another O(N) over
std::map
to populate thestd::priority_queue
Here also you forgot to account for the complexity of pushing into std::priority_queue
, which is \$O(\log N)\$ for each push (amortized, since the underlying std::vector
might have to resize from time to time).
followed by removing the top K elements from the
std::priority_queue
.
Which is \$O(K \log N)\$, since each pop from std::priority_queue
is \$O(\log N)\$, and inserting into the vector of results is \$O(1)\$ (amortized).
Together your whole function runs in \$O(N \log N)\$ time.
Improving performance
The first thing I would do is change the std::map
to a std::unordered_map
, as the latter one has \$O(1)\$ insertions and lookups. That will speed up the frequency counting at least.
I just feel there's unnecessary work being done specially if all one needs is the top 1 element
Indeed, if you only need the top 1 element you would not bother with the priority queue at all. Instead, while counting, you could keep track of which counter is the largest.
If \$K\$ is small, you could think of keeping track of just the \$K\$ most frequent elements while counting. That's similar to what is described in this StackOverflow post. But Toby's second answer shows another way to do it.
Ideally, I'd have a container that orders the elements based on the frequency (not sure how I can get
std::map
to do so that orders based on the keys).
You can do that, but every time the frequency of an element is updated, you'd have to remove that element from the map and re-insert it to maintain the correct order. That's a \$O(\log N\$) operation each time you increase a counter, so it's not going to be faster than the algorithm mentioned above.
-
\$\begingroup\$ A container that reorders itself would just have to swap one element with another when its count is updated enough to advance its rank. But you'd somehow need to find the element by value, which is going to have its own overhead. It might be possible if we maintain a map of references, and update that map (in O(log N)) whenever we have to swap elements. \$\endgroup\$Toby Speight– Toby Speight2022年12月14日 10:40:23 +00:00Commented Dec 14, 2022 at 10:40
-
1\$\begingroup\$ @TobySpeight True, I did not consider that incrementing by one is kind of a special case that can be implemented more optimal in a priority queue, but you still have the lookup. If updating that map of references is \$O(\log N)\$ it doesn't buy you anything complexity-wise, but maybe it does improve the constant factor? \$\endgroup\$G. Sliepen– G. Sliepen2022年12月14日 10:47:53 +00:00Commented Dec 14, 2022 at 10:47
-
1\$\begingroup\$ That O(N log K) algorithm is what's implemented in this sample implementation of
std::partial_sort_copy
. (The N on that page is our K; L1 is our N). \$\endgroup\$Toby Speight– Toby Speight2022年12月14日 11:56:42 +00:00Commented Dec 14, 2022 at 11:56 -
\$\begingroup\$ I worked out how to reorder the container without losing track of its elements (using
std::list
), after a long battle with reverse iterators. I posted that as an additional answer for anyone interested. \$\endgroup\$Toby Speight– Toby Speight2022年12月14日 17:37:51 +00:00Commented Dec 14, 2022 at 17:37 -
\$\begingroup\$ 1)
It runs O(N) to store the frequency of each element in a std::map.
. Sorry I misworded. I meantO(N)
to run over the vector as we append tostd::map
. 2)Which is O(KlogN), since each pop from std::priority_queue is O(logN)
. When could it beO(KlogN)
vsO(KlogN)
? 3) regarding the SO link, I am able to understand a bit of what's going on but not the whole picture: is max referring to the topmost element in the heap here or just a local variable? a sample code shall help a lot \$\endgroup\$xyf– xyf2022年12月14日 18:51:44 +00:00Commented Dec 14, 2022 at 18:51
You're missing some necessary headers:
#include <map>
#include <queue>
#include <utility>
#include <vector>
using std::pair;
using std::vector;
I recommend ditching those using
statements and just writing those types in full (not only is that shorter, it's also much clearer).
The function accepts nums
by value, but I don't think we need to copy this vector:
std::vector<int> TopKFreqElem(const std::vector<int>& nums, int k)
It doesn't make sense for k
to be negative. It's probably best to make it consistent with the size of a vector (i.e. std::size_t
).
We could make the function more general, as a template that accepts any range type, and writing to an iterator:
template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void TopKFreqElem(const R& nums, std::size_t k, It out)
Counting using std::map
is a good technique, but we should use std::size_t
for the counts.
The implementation using std::priority_queue
is okay, except that we should test k
at the beginning of the loop, in case 0 is passed in: while (k-- && !pq.empty())
.
One simple performance improvement would be to reserve()
space for k
elements in the vector before we start inserting.
However, the standard library gives us std::partial_sort()
which we can use to get the k
lowest (or, with suitable comparator, the k
highest) elements. Even better for our purposes is std::partial_sort_copy()
, which doesn't require a copy of the whole counter.
Modified code
#include <algorithm>
#include <map>
#include <iterator>
#include <ranges>
#include <utility>
#include <vector>
template<std::ranges::range R, std::output_iterator<typename R::value_type> It>
void copy_most_frequent(const R& nums, It out, std::size_t k)
{
using Counter = std::map<typename R::value_type, std::size_t>;
Counter counter;
for (auto& elem : nums) {
++counter[elem];
}
using result_type = std::pair<typename R::value_type, std::size_t>;
std::vector<result_type> result(k);
std::ranges::partial_sort_copy(counter, result, std::ranges::greater{},
&Counter::value_type::second, &result_type::second);
std::ranges::transform(result, out, &result_type::first);
}
And a small test program:
#include <array>
#include <iostream>
#include <string>
#include <string_view>
int main()
{
std::vector v{1, 1, 2, 2, 2, 3, 3, 3, 3, 4, 4};
copy_most_frequent(v, std::ostream_iterator<int>(std::cout, "\n"),
2);
const std::array<std::string, 5> a{"one", "two", "two", "three", "four"};
copy_most_frequent(a, std::ostream_iterator<std::string_view>(std::cout, "\n"),
1);
}
Algorithmic complexity is still O(N log N), but we have simpler implementation and greater flexibility.
For value types that can be hashed, we could improve the count phase using std::unordered_map
:
template<typename T>
concept is_hashable = requires { std::hash<T>{}; };
using value_type = typename R::value_type;
using Counter =
std::conditional_t<is_hashable<value_type>,
std::unordered_map<value_type, std::size_t>,
std::map<value_type, std::size_t>>;
Counter counter;
That reduces the count phase to O(N)
, so the whole function is now dominated by the sort, which is O(N log K)
.
General points
Use a reference
std::vector<int>& nums
rather than copy all the valuesstd::vector<int> nums
. Best to makenums
a constant as well.Argument
k
can be a constant.(see rewrite replacingk
withsize_t n
)Argument
k
represents a size / count and should not be anint
as a negative count makes no sense in this context. Best to use a typestd::size_t
Same applies to all the various counters you have.Its a bad habit to not delimit code blocks. eg
if (!--k) break;
is better asif (!--k) { break; }
I am not a fan of
auto
others will insist it's better than sliced bread. My opinion is that it obscures type and is bad for code readability. Don't useauto
Why use a reference in
for (auto& elem : nums) {
? could be as simple asfor (int num : nums)
. Also the nameelem
is a little odd, maybenum
would be better?You can use the simpler
unordered_map
to count instances of each num.counter
name as singular? Maybe better named ascounters
?Populating the final result (while loop at end) can be simplified. Though the compiler will optimise, no point having code get in the way of readability that is never compiled into running code. (see rewrite for simply version)
Rewrite
- Removes use of
auto
. (up to you) - Rearranges declarations.
- Uses
size_t
for all counting. - Some renaming.
- Open blocks on same line (keeping up with modern C like language layouts)
#include <unordered_map>
#include <queue>
#include <vector>
std::vector<int> TopKFreqElem(const std::vector<int>& nums, const std::size_t k) {
std::unordered_map<int, std::size_t> counts;
for (int num : nums) { counts[num]++; }
std::priority_queue<std::pair<std::size_t, int>> frequent;
for (const std::pair<const int, const std::size_t>& p : counts) {
frequent.push({p.second, p.first});
}
std::size_t n = std::min(frequent.size(), k);
std::vector<int> topKNums;
while (n--) {
topKNums.push_back(frequent.top().second);
frequent.pop();
}
return topKNums;
}
-
\$\begingroup\$ It's probably worth
topKNums.reserve(n)
in there before you start appending elements. \$\endgroup\$Toby Speight– Toby Speight2022年12月14日 21:59:59 +00:00Commented Dec 14, 2022 at 21:59
Since you said
Ideally, I'd have a container that orders the elements based on the frequency
It is actually possible to do that, and it might be useful if you want to track the most frequent values as each element is added. The trick is to keep the counts in a list, and maintain the order of the list so that it's in descending order of count.
To find the list elements, we use a map (or unordered map) from value to an iterator into the list - this is safe because our list operations will not invalidate the iterators. It's cheap - O(1) - to reorder nodes within a list using splice()
, so we have a container where adding elements is O(1) if they are hashable, else O(log N) if sortable.
The implementation looks like this (just implementing the two methods needed here - adding clear()
, remove()
etc. is left as an exercise for the interested reader):
#include <algorithm>
#include <list>
#include <map>
#include <iterator>
#include <ranges>
#include <unordered_map>
#include <vector>
template<typename T>
concept is_hashable = requires { std::hash<T>{}; };
template<typename T>
class frequency_ordered_multiset
{
struct item { const T val; std::size_t count; };
// We use a list because its iterators stay valid after modifications.
// The list is always in descending order of count.
std::list<item> items = {};
using items_iter = std::list<item>::iterator;
std::conditional_t<is_hashable<T>,
std::unordered_map<T, items_iter>,
std::map<T, items_iter>>
by_value = {};
public:
void insert(const T& val)
{
auto map_it = by_value.find(val);
if (map_it == by_value.end()) {
// first appearance - put it at end of list
items.emplace_back(val, 1);
auto list_it = items.end();
by_value.emplace(val, --list_it);
return;
}
// update existing item
auto const list_it = map_it->second;
auto const new_count = ++list_it->count;
// search backwards for higher or equal count
auto const other_it =
std::ranges::lower_bound(std::reverse_iterator{list_it}, items.rend(),
new_count, std::less{}, &item::count).base();
if (other_it != list_it) {
// move the item (without invalidating iterators)
items.splice(other_it, items, list_it);
}
}
auto most_frequent(std::size_t k)
{
return items
| std::views::take(k)
| std::views::transform(&item::val);
}
};
#include <array>
#include <iostream>
int main()
{
static const std::array v{1, 1, 3, 3, 2, 2, 2, 3, 3, 3, 3, 4, 4};
frequency_ordered_multiset<int> fom;
for (auto i: v) {
fom.insert(i);
std::ranges::copy(fom.most_frequent(2),
std::ostream_iterator<int>(std::cout, ", "));
std::cout << '\n';
}
}
-
\$\begingroup\$ I posted this code for review as Container that orders its elements based on their frequency. \$\endgroup\$Toby Speight– Toby Speight2022年12月19日 16:47:39 +00:00Commented Dec 19, 2022 at 16:47
Consider simplicity:
- Get a copy of the range in a vector.
- Sort it.
- For each chunk of equivalent elements, push a count,element pair on a limited size heap.
- Return the elements in the heap.
The big advantages are nearly eliminated allocation-overhead, and much improved locality of reference.
Though if there are few unique elements which are oft-repeated, using a hashmap probably wins.