Counting the number of distinct words in the input

Question 1

Below is a working program I wrote. Please provide comments to help me improve my coding and problem solving skills. I am learning C++ myself by reading Accelerated C++ by Andrew Koenig.

#include <iostream>
#include <vector>
using namespace std;
int main()
{
 vector<string> words;
 cout<<"Please Enter words(Press Ctrl+Z in the end)"<<endl;
 string x; //Word Input
 cin>>x;
 words.push_back(x); //The first word
 int ndw=1; //Number of distinct words
 while(cin>>x) //Input new word
 {
 for(unsigned int counter = 0; counter!=words.size(); ++counter)
 {
 //Check if we already have this word in our list
 if(x!=words[counter])
 {
 if(counter==words.size()-1)//We have reached the end of list
 {
 words.push_back(x);
 ndw+=1;
 }
 }
 else
 {
 //If there is a match, leave this word
 break;
 }
 }
 }
 cout<<"number of distinct words are: "<<ndw;
 return 0;
}

Here is a sample output

Sample Output 1

Another Sample output

enter image description here

Question 2

Here are some things that may help you improve your program.

Don't abuse `using namespace std`

Putting using namespace std at the top of every program is a bad habit that you'd do well to avoid.

Use all the required `#include` files

The code uses std::string but doesn't include the corresponding file. The code should have this line added:

#include <string>

Decompose the program into smaller parts

Right now, all of the code is in main which isn't necessarily wrong, but it means that it's not only hard to reuse but also hard to troubleshoot. Better is to separate the code into small chunks. It makes it both easier to understand and easier to fix or improve.

Use an appropriate data structure

The current code uses a std::vector to hold the words and linearly searches for each new word. Far better would be to use a std::unordered_map for this. Here's a version which not only counts each unique word, but also counts the number of occurrences of each word:

#include <iostream>
#include <string>
#include <unordered_map>
int main() {
 std::unordered_map<std::string, unsigned> dict;
 for (std::string word; std::cin >> word; ) {
 ++dict[word];
 }
 std::cout << "number of distinct words are: " << dict.size() << "\n";
}

There is a significant performance benefit. I used both the original and this version to count all the words in the Project Gutenberg eBook of Dracula, by Bram Stoker. Both correctly reported 19027 distinct words, but the original took 1.011 s and the version above took 0.057 s (17 times faster).

Don't use std::endl unless you really need to flush the stream

The difference between std::endl and '\n' is that std::endl actually flushes the stream. This can be a costly operation in terms of processing time, so it's best to get in the habit of only using it when flushing the stream is actually required. It's not for this code.

Use a newer book

The book you have was fine in its day, but it is now woefully out of date. I'd recommend Stroustrup's book "A Tour of C++" instead, since the current edition covers C++11 which is a very much improved and much different language than previous versions of C++.

Omit `return 0`

When a C or C++ program reaches the end of main the compiler will automatically generate code to return 0, so there is no need to put return 0; explicitly at the end of main.

Note: when I make this suggestion, it's almost invariably followed by one of two kinds of comments: "I didn't know that." or "That's bad advice!" My rationale is that it's safe and useful to rely on compiler behavior explicitly supported by the standard. For C, since C99; see ISO/IEC 9899:1999 section 5.1.2.2.3:

[...] a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0.

For C++, since the first standard in 1998; see ISO/IEC 14882:1998 section 3.6.1:

If control reaches the end of main without encountering a return statement, the effect is that of executing return 0;

All versions of both standards since then (C99 and C++98) have maintained the same idea. We rely on automatically generated member functions in C++, and few people write explicit return; statements at the end of a void function. Reasons against omitting seem to boil down to "it looks weird". If, like me, you're curious about the rationale for the change to the C standard read this question. Also note that in the early 1990s this was considered "sloppy practice" because it was undefined behavior (although widely supported) at the time.

So I advocate omitting it; others disagree (often vehemently!) In any case, if you encounter code that omits it, you'll know that it's explicitly supported by the standard and you'll know what it means.

Question 3

I'm happy to be of service.

Question 4

Your advice on return 0 is identical to mine. But it get boring typing it out all the time (so I am sometimes lazy and more abrupt). Is there a way we can start a common review answers section and just link to that each time?

Question 5

I think that's a great idea. Let's figure out how to make that happen. Have you time to chat in 2nd Monitor?

Question 6

You can try using a set instead vector. Set keeps distinct elements

It's like this

#include<set>// include this library
---------------------------------------------------------
set<string> words;
while(cin >> x)
 words.insert(x);
cout << words.size() << endl;

You can use this reference http://www.cplusplus.com/reference/set/set/?kw=set

Question 7

You can even count each word using std::map:

std::map< std::string, std::size_t > m;
std::string s;
while (std::cin >> s) {
 ++m[s];
}

Edward Edward 67.2k4 gold badges120 silver badges284 bronze badges · Accepted Answer · 2017-03-06 16:49:23Z

Here are some things that may help you improve your program.

Don't abuse `using namespace std`

Putting using namespace std at the top of every program is a bad habit that you'd do well to avoid.

Use all the required `#include` files

The code uses std::string but doesn't include the corresponding file. The code should have this line added:

#include <string>

Decompose the program into smaller parts

Right now, all of the code is in main which isn't necessarily wrong, but it means that it's not only hard to reuse but also hard to troubleshoot. Better is to separate the code into small chunks. It makes it both easier to understand and easier to fix or improve.

Use an appropriate data structure

The current code uses a std::vector to hold the words and linearly searches for each new word. Far better would be to use a std::unordered_map for this. Here's a version which not only counts each unique word, but also counts the number of occurrences of each word:

#include <iostream>
#include <string>
#include <unordered_map>
int main() {
 std::unordered_map<std::string, unsigned> dict;
 for (std::string word; std::cin >> word; ) {
 ++dict[word];
 }
 std::cout << "number of distinct words are: " << dict.size() << "\n";
}

There is a significant performance benefit. I used both the original and this version to count all the words in the Project Gutenberg eBook of Dracula, by Bram Stoker. Both correctly reported 19027 distinct words, but the original took 1.011 s and the version above took 0.057 s (17 times faster).

Don't use std::endl unless you really need to flush the stream

The difference between std::endl and '\n' is that std::endl actually flushes the stream. This can be a costly operation in terms of processing time, so it's best to get in the habit of only using it when flushing the stream is actually required. It's not for this code.

Use a newer book

The book you have was fine in its day, but it is now woefully out of date. I'd recommend Stroustrup's book "A Tour of C++" instead, since the current edition covers C++11 which is a very much improved and much different language than previous versions of C++.

Omit `return 0`

When a C or C++ program reaches the end of main the compiler will automatically generate code to return 0, so there is no need to put return 0; explicitly at the end of main.

Note: when I make this suggestion, it's almost invariably followed by one of two kinds of comments: "I didn't know that." or "That's bad advice!" My rationale is that it's safe and useful to rely on compiler behavior explicitly supported by the standard. For C, since C99; see ISO/IEC 9899:1999 section 5.1.2.2.3:

[...] a return from the initial call to the main function is equivalent to calling the exit function with the value returned by the main function as its argument; reaching the } that terminates the main function returns a value of 0.

For C++, since the first standard in 1998; see ISO/IEC 14882:1998 section 3.6.1:

If control reaches the end of main without encountering a return statement, the effect is that of executing return 0;

All versions of both standards since then (C99 and C++98) have maintained the same idea. We rely on automatically generated member functions in C++, and few people write explicit return; statements at the end of a void function. Reasons against omitting seem to boil down to "it looks weird". If, like me, you're curious about the rationale for the change to the C standard read this question. Also note that in the early 1990s this was considered "sloppy practice" because it was undefined behavior (although widely supported) at the time.

So I advocate omitting it; others disagree (often vehemently!) In any case, if you encounter code that omits it, you'll know that it's explicitly supported by the standard and you'll know what it means.

Your advice on return 0 is identical to mine. But it get boring typing it out all the time (so I am sometimes lazy and more abrupt). Is there a way we can start a common review answers section and just link to that each time?
I think that's a great idea. Let's figure out how to make that happen. Have you time to chat in 2nd Monitor?

Stack Exchange Network

Counting the number of distinct words in the input

3 Answers 3

Don't abuse `using namespace std`

Use all the required `#include` files

Decompose the program into smaller parts

Use an appropriate data structure

Don't use std::endl unless you really need to flush the stream

Use a newer book

Omit `return 0`

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Counting the number of distinct words in the input

3 Answers 3

Don't abuse using namespace std

Use all the required #include files

Decompose the program into smaller parts

Use an appropriate data structure

Don't use std::endl unless you really need to flush the stream

Use a newer book

Omit return 0

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

Don't abuse `using namespace std`

Use all the required `#include` files

Omit `return 0`