Count the frequencies of each word in a file

Question 1

I have tried to implement a code that does the following:

store words in an array (unordered)
use function order_array() to place repetitions for the same word near (so the result, given a sample text file, could be: mom mom mom mom lamb lamb eye eye pain tears)
use function count_frequency to store every word one single time + its frequency (I store them in a linked list)
print the linked list to obtain given result

The code works, but it appears (to my eye) cumbersone and unefficient. How can I improve it?

Note that I am not familiar with concepts like vectors, hash maps or dictionaries yet, but I have a good grasp of basic data structures (eg linked lists, stacks, queues, binary search trees). Still, any hint, help or advice would be greatly appreciated

using namespace std;
#include<iostream>
#include<fstream>
#include<string>
const int DIM = 50;
int count = 0;
string list[DIM];
string ordered_list[DIM];
struct node
{
 string word;
 int frequency;
 node* next;
};
node* head = NULL;
node* remove_node(node* &h)
{
 node* s = head;
 head = s->next;
 return s; // and then remove everything
}
void add_node(string &w, int &f, node* &h)
{
 node* t = new node;
 t->word = w;
 t-> frequency = f;
 t->next = h;
 h = t;
}
void print_highestf(node* &h)
{
 node* s = h;
 int higherf = 0;
 while(s!=NULL)
 {
 if((s->frequency) >= higherf)
 {
 higherf = s->frequency;
 }
 s = s->next;
 }
 cout << "The highest frequency for a word is: " << higherf << endl;
}
void print_res(node* &h)
{
 for(node* s = head; s != NULL; s = s->next)
 {
 cout << "Frequency for word " << s->word << ": " << s->frequency << endl;
 cout << "\n";
 }
}
void upload_array(string &w, string a[])
{
 
 a[count] = w;
 count ++;
 
}
bool is_new_string(string &word, int position) //check better
{
 bool repeated = false;
 for(int i = 0; i < position; i++)
 {
 if(word == list[i] && repeated == true)
 {
 return false;
 }
 else if(word == list[i] && repeated == false)
 {
 repeated = true;
 }
 }
 return true;
}
void order_array(string a[], string oa[])
{
 int i = 0;
 int times = 0;
 while(i!= count)
 {
 int j = i;
 while(j != count)
 {
 if((a[i] == a[j]) && is_new_string((a[i]), i+1))
 {
 //new node 
 oa[times]=a[i]; 
 times++;
 }
 
 j++;
 }
 i++;
 }
}
void count_frequency(string oa[])
{
 int frequency = 1;
 for(int i = 0; i<count-1; i++)
 {
 if(oa[i] != oa[i+1])
 {
 add_node(oa[i], frequency, head);
 frequency = 1;
 }
 else if(oa[i] == oa[i+1])
 {
 frequency++;
 }
 }
}
int main()
{
 fstream mystream;
 mystream.open("words.txt", ios::in);
 string word;
 
 //upload array of strings
 while(!mystream.eof())
 {
 mystream >> word;
 upload_array(word, list);
 
 }
 
 order_array(list, ordered_list);
 count_frequency(ordered_list);
 
 print_res(head);
 print_highestf(head);
 
 
 for(node* s= head; s!=NULL; s = s->next)
 {
 node* r = remove_node(head);
 delete r;
 }
 cout << "\nDeinitialization executed. " << endl;
 mystream.close();
 return 0;
}

Question 2

Kudos on writing a working program! You already sense that things can be improved, which is also great. Toby Speight already pointed out some issues in your code. You are a beginning C++ programmer, so it's normal that you don't write optimal code yet; you have just learned the basic tools. To show you how a more experienced C++ programmer would tackle the problem of counting frequencies, let me show you this implementation:

#include <cstdlib>
#include <algorithm>
#include <format>
#include <fstream>
#include <iostream>
#include <ranges>
#include <unordered_map>
#include <vector>
// A struct to hold a value and its frequency
template<typename T>
struct counted_value {
 std::size_t count;
 T value;
 // Add a default comparison operator so we can sort on these structs
 friend auto operator<=>(const counted_value&, const counted_value&) = default;
};
// A templated function that can take any kind of input
template<std::ranges::input_range Range>
auto count_frequencies(Range&& range) {
 // Deduce the type of values stored in the range
 using value_type = decltype(std::ranges::begin(range))::value_type;
 // Count frequencies in a hash map
 std::unordered_map<value_type, std::size_t> counters;
 for (const auto& value: range) {
 counters[value]++;
 }
 // Copy the result into a vector
 std::vector<counted_value<value_type>> result;
 result.reserve(counters.size());
 for (const auto& [value, count]: counters) {
 result.emplace_back(count, value);
 }
 // Sort the vector on the frequency count, largest count first
 std::ranges::sort(result, std::greater{});
 return result;
}
int main() {
 std::ifstream input("words.txt");
 // Create a view of the words in the input stream
 auto words = std::ranges::istream_view<std::string>(input);
 auto frequencies = count_frequencies(words);
 // We should have reached the end, if we didn't something went wrong
 if (!input.eof()) {
 std::cerr << "Error reading input\n";
 return EXIT_FAILURE;
 }
 for (const auto& item: frequencies) {
 std::cout << std::format("Frequency for word {}: {}\n",
 item.value, item.count);
 }
}

I've used a lot of features from the standard library and from the language itself, in particular features from C++20. I'll list them here with links to cppreference.com:

template so you can create a frequency counting function that can count any kind of input, not just an array of strings.
A default comparison operator to make a struct comparable, which helps it sort later.
The concept std::ranges::input_range to restrict the input of count_frequencies() to ranges that can be iterated over. It's not really necessary here, but it results in nicer error messages if you call the function with the wrong type of input.
using to declare a type alias, this is just to save typing.
decltype() to deduce the type of values stored in the input range.
std::unordered_map is basically a dictionary, we use it to count how often a given value appears in the input.
Range-based for loops, this is often simpler and safer than the old for loops where you manually iterate.
std::vector is basically a dynamically sized array. It takes care of memory allocation for you. We use it to store the final result.
std::ranges::sort() provides an easy way to sort containers. Normally it sorts so the smallest value comes first, but we pass it std::greater so we sort it largest value first.
The auto type lets the compiler deduce the type for you. This saves typing and reduces the chance of making mistakes.
std::ranges::istream_view() is a helper that allows you to convert an input stream (like the file you open) into a view that can be iterated over using range-based for loops and with other algorithms.
std::format() makes it much easier to format strings. C++23 also introduced std::print() which makes it even easier to print things, but not all compilers support it yet.

Don't worry about all of this, just continue your programming journey, and eventually you learn most or all of the above.

Question 3

Thank you so much for your detailed answer. There is still a lot to go but your precise explanation helped me seeing what I need to learn in order to improve!

Question 4

Please don't do this:

using namespace std;

That immediately takes away the benefit of namespaces, making it harder to be sure which namespace identifiers come from. It's particularly harmful when you declare names of your own (e.g. list and next) that are the same as those in std.

This is an anti-pattern:

while(!mystream.eof())
{
 mystream >> word;
 upload_array(word, list);
 
}

The EOF flag is set on a stream when input has failed due to reaching the end. It does not guarantee that the next read will succeed. What we want is

 while(mystream >> word)
 {
 upload_array(word, list); 
 }

Whilst implementing your own linked list may be an instructive experience, I think it's more useful to familiarise yourself with the standard library containers. As a C++ programmer, you'll be using these on a daily basis, whereas you may never need to implement one of your own.

This program would be much, much simpler if we used one of the standard map classes (std::map<std::string, std::size_t> if we want the results to be in a repeatable order, or the std::unordered_map equivalent if not). And we wouldn't have any raw pointers, so no need to verify that every new is matched with a delete.

The program would be more useful if it read from std::cin rather than requiring a specifically-named file for its input. That allows the user to redirect from any file, or from a pipe, or even interactively.

The end of the program has some unnecessary code:

cout << "\nDeinitialization executed. " << endl;
mystream.close();
return 0;

There's no need to flush the output stream, given that happens automatically at program termination, so we can remove the std::endl and simply add \n to the end of the string literal.

The destructor of mystream will automatically close the file, so that's unnecessary unless we're actually going to take any notice of the result.

Executing off the end of main() (but no other function) is defined as returning 0 (success), so the return can be omitted.

Question 5

Thank you very much for your answer, I have learnt something new and useful. I will treasure what you have said !

G. Sliepen G. Sliepen 68.9k3 gold badges74 silver badges179 bronze badges · Accepted Answer · 2024-02-03 20:14:13Z

Kudos on writing a working program! You already sense that things can be improved, which is also great. Toby Speight already pointed out some issues in your code. You are a beginning C++ programmer, so it's normal that you don't write optimal code yet; you have just learned the basic tools. To show you how a more experienced C++ programmer would tackle the problem of counting frequencies, let me show you this implementation:

#include <cstdlib>
#include <algorithm>
#include <format>
#include <fstream>
#include <iostream>
#include <ranges>
#include <unordered_map>
#include <vector>
// A struct to hold a value and its frequency
template<typename T>
struct counted_value {
 std::size_t count;
 T value;
 // Add a default comparison operator so we can sort on these structs
 friend auto operator<=>(const counted_value&, const counted_value&) = default;
};
// A templated function that can take any kind of input
template<std::ranges::input_range Range>
auto count_frequencies(Range&& range) {
 // Deduce the type of values stored in the range
 using value_type = decltype(std::ranges::begin(range))::value_type;
 // Count frequencies in a hash map
 std::unordered_map<value_type, std::size_t> counters;
 for (const auto& value: range) {
 counters[value]++;
 }
 // Copy the result into a vector
 std::vector<counted_value<value_type>> result;
 result.reserve(counters.size());
 for (const auto& [value, count]: counters) {
 result.emplace_back(count, value);
 }
 // Sort the vector on the frequency count, largest count first
 std::ranges::sort(result, std::greater{});
 return result;
}
int main() {
 std::ifstream input("words.txt");
 // Create a view of the words in the input stream
 auto words = std::ranges::istream_view<std::string>(input);
 auto frequencies = count_frequencies(words);
 // We should have reached the end, if we didn't something went wrong
 if (!input.eof()) {
 std::cerr << "Error reading input\n";
 return EXIT_FAILURE;
 }
 for (const auto& item: frequencies) {
 std::cout << std::format("Frequency for word {}: {}\n",
 item.value, item.count);
 }
}

I've used a lot of features from the standard library and from the language itself, in particular features from C++20. I'll list them here with links to cppreference.com:

template so you can create a frequency counting function that can count any kind of input, not just an array of strings.
A default comparison operator to make a struct comparable, which helps it sort later.
The concept std::ranges::input_range to restrict the input of count_frequencies() to ranges that can be iterated over. It's not really necessary here, but it results in nicer error messages if you call the function with the wrong type of input.
using to declare a type alias, this is just to save typing.
decltype() to deduce the type of values stored in the input range.
std::unordered_map is basically a dictionary, we use it to count how often a given value appears in the input.
Range-based for loops, this is often simpler and safer than the old for loops where you manually iterate.
std::vector is basically a dynamically sized array. It takes care of memory allocation for you. We use it to store the final result.
std::ranges::sort() provides an easy way to sort containers. Normally it sorts so the smallest value comes first, but we pass it std::greater so we sort it largest value first.
The auto type lets the compiler deduce the type for you. This saves typing and reduces the chance of making mistakes.
std::ranges::istream_view() is a helper that allows you to convert an input stream (like the file you open) into a view that can be iterated over using range-based for loops and with other algorithms.
std::format() makes it much easier to format strings. C++23 also introduced std::print() which makes it even easier to print things, but not all compilers support it yet.

Don't worry about all of this, just continue your programming journey, and eventually you learn most or all of the above.

Thank you so much for your detailed answer. There is still a lot to go but your precise explanation helped me seeing what I need to learn in order to improve!

Stack Exchange Network

Count the frequencies of each word in a file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Count the frequencies of each word in a file

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions