Count word frequencies, and print them most-frequent first

Question 1

I'm writing a comparison of a simple word-counting program in various languages. The task is to count the frequencies of unique, space-separated words in the input, and output the results most frequent first. Case should be normalized to lowercase (ASCII is okay).

But it's been a long time since I've written C++ (before the C++11 days). Is this idiomatic modern C++? Any improvements I could make to make this more idiomatic or simpler (I'm not looking for efficiency improvements here). I want to stick to the standard library.

#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main() {
 string word;
 unordered_map<string, int> counts;
 while (cin >> word) {
 transform(word.begin(), word.end(), word.begin(),
 [](unsigned char c){ return tolower(c); });
 counts[word]++;
 }
 vector<pair<string, int>> ordered(counts.begin(), counts.end());
 sort(ordered.begin(), ordered.end(), [](auto &a, auto &b) {
 return a.second > b.second;
 });
 for (auto count : ordered) {
 cout << count.first << " " << count.second << "\n";
 }
 return 0;
}

Question 2

When the article is ready, it would be great if you were to link to it from here (assuming it will be on the Web).

Question 3

@TobySpeight Good idea -- will do!

Question 4

I spent more time that I should have trying to answer this question using std::make_heap since it brings the largest value to the top automatically. It did not go well...

Question 5

Published article: benhoyt.com/writings/count-words

Question 6

Mainly this looks fine.

using namespace std; is a bad habit to get into due to potential name collisions. Instead prefer to type std:: where necessary.
counts[word]++;. We don't need an un-incremented copy of the count, so we should use pre-increment here instead: ++counts[word];
sort(ordered.begin(), ordered.end(), [](auto &a, auto &b) we don't alter a or b in the lambda, so these should be const&.
for (auto count : ordered) copies the pairs. Again, we should be using auto const& here.
return 0; we don't need this, as the compiler will add it for us automatically. We might also use a named constant instead of 0, specifically EXIT_SUCCESS from <cstdlib>.

Question 7

You missed the failure to include <cctype>, needed for std::tolower().

Question 8

Thank you -- appreciate these tweaks!

Question 9

user673679's answer deals with low-level review; I'll look at the algorithm.

First, top marks for avoiding the most common error with functions from <cctype> - it's vitally important to pass the value as unsigned char promoted to int, rather than plain char.

Rather than writing my own loop for reading the words, I would consider transform() from an input-stream iterator to a map inserter. That said, we'd need to construct a custom iterator for the inserter we want. It would be very straightforward if we were using a std::multiset, but that would use more memory (as it stores all the added objects). We could create a "counter" class with appropriate push_back(), if we were likely to use it again.

After the while loop, we should test std::cin.bad() (or set the stream to throw on error). At present, any stream error is ignored, and we proceed as if we'd read the entire input, giving misleading results.

Instead of populating a vector, and subsequently sorting it, we might prefer to insert directly into an ordered container (std::set perhaps).

Something we can do in modern C++ is to write nested functions, by assigning a lambda expression to a variable. This may be clearer than writing the lambda inline. Or we can use the anonymous namespace for functions with static linkage.

Here's my modification of the code to have no explicit loops, or indeed any flow-control statements:

#include <algorithm>
#include <cctype>
#include <iostream>
#include <iterator>
#include <set>
#include <string>
#include <unordered_map>
#include <utility>

namespace {
 auto downcase(std::string s) {
 std::transform(s.begin(), s.end(), s.begin(),
 [](unsigned char c){ return std::tolower(c); });
 return s;
 }
}

int main()
{
 using counter = std::unordered_map<std::string, unsigned>;
 using in_it = std::istream_iterator<std::string>;
 using out_it = std::ostream_iterator<std::string>;
 counter counts;
 auto insert = [&](std::string s) { ++counts[downcase(std::move(s))]; };
 // read words into counter
 std::cin.exceptions(std::istream::badbit);
 std::for_each(in_it{std::cin}, in_it{}, insert);
 // sort by frequency, then alphabetical
 auto by_freq_then_alpha = [](const auto &a, const auto &b) {
 return std::pair{ b.second, a.first } < std::pair{ a.second, b.first};
 };
 const std::set ordered{counts.begin(), counts.end(), by_freq_then_alpha};
 // write the output
 auto format = [](const auto& count) {
 return count.first + ' ' + std::to_string(count.second) + '\n';
 };
 std::transform(ordered.begin(), ordered.end(), out_it{std::cout}, format);
}

For a more modern take, C++20 includes the Ranges library, which lets us use views to transform collections:

int main()
{
 using counter = std::unordered_map<std::string, unsigned>;
 counter counts;
 auto insert = [&counts](std::string s) { ++counts[downcase(std::move(s))]; };
 // read words into counter
 std::cin.exceptions(std::istream::badbit);
 auto input_words = std::ranges::istream_view<std::string>(std::cin);
 std::ranges::for_each(input_words, insert);
 // sort by frequency, then alphabetical
 auto by_freq_then_alpha = [](const auto &a, const auto &b) {
 return std::pair{ b.second, a.first } < std::pair{ a.second, b.first};
 };
 const std::set ordered{counts.begin(), counts.end(), by_freq_then_alpha};
 // write the output
 auto write_out = [](const auto& count) {
 std::cout << count.first << ' ' << count.second << '\n';
 };
 std::ranges::for_each(ordered, write_out);
}

I'm not saying that either of these is necessarily what you should write; at least for now, the range-based loops are more familiar to most C++ programmers, and so probably clearest. However, it showcases some of the options we have in modern C++.

Question 10

While correct and very idiomatic C++, I actually slightly prefer OPs code for readability, it's slightly shorter and easier to scan, then again I have 20 years of habit/experience reading for loops and they just read clearer to me than transform/for_each and company from <algorithm>. They do have their place tho.

Question 11

Well, this has been an education, thank you! Definitely a very different, more functional approach. I do prefer the plainer, for-loop approach (perhaps because this reminds me of Scala, which I don't have warm fuzzy feelings for). And thanks for the tip about error handling.

Question 12

Added: structured bindings

Changed: to using range based algorithms. (C++20) This removes the need for begin()/end() and you can use a projection in the sort call.

Removed: lambdas - not needed

#include <algorithm>
#include <iostream>
#include <string>
#include <unordered_map>
#include <vector>
using namespace std;
int main()
{
 unordered_map<string, int> counts;
 string word;
 while (cin >> word) {
 ranges::transform(word, word.begin(), ::tolower);
 counts[word]++;
 }
 using word_count = pair<string, int>;
 vector<word_count> ordered(counts.begin(), counts.end());
 ranges::sort(ordered, greater{}, &word_count::second);
 for (auto [word, count] : ordered) {
 cout << word << " " << count << "\n";
 }
}

Question 13

The lambda wrapping std::tolower is very definitely needed - using the <cctype> functions by directly promoting their arguments from plain char to int leads to UB on platforms where char is signed. Always convert to unsigned char before converting to int.

Question 14

The range stuff is great - thanks for helping me get more proficient with that part of the Library!

user673679 user673679 12.2k2 gold badges34 silver badges65 bronze badges · Accepted Answer · 2021-03-09 08:40:07Z

Mainly this looks fine.

using namespace std; is a bad habit to get into due to potential name collisions. Instead prefer to type std:: where necessary.
counts[word]++;. We don't need an un-incremented copy of the count, so we should use pre-increment here instead: ++counts[word];
sort(ordered.begin(), ordered.end(), [](auto &a, auto &b) we don't alter a or b in the lambda, so these should be const&.
for (auto count : ordered) copies the pairs. Again, we should be using auto const& here.
return 0; we don't need this, as the compiler will add it for us automatically. We might also use a named constant instead of 0, specifically EXIT_SUCCESS from <cstdlib>.

You missed the failure to include <cctype>, needed for std::tolower().

Stack Exchange Network

Count word frequencies, and print them most-frequent first

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count word frequencies, and print them most-frequent first

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions