Max repeated word in a string

Question 1

I was trying to do a very common interview problem "Finding the max repeated word in a string " and could not find much resources in net for c/c++ implementation. So I coded it myself here. I have tried to do most of the coding from scratch for better understanding. Could you review my code and provide comments on my algorithm. Some people have suggested using hashtables for storing the count, but am not using hashtables here.

#include<stdafx.h>
#include<stdlib.h>
#include<stdio.h>
#include<string>
#include<iostream>
using namespace std;
string word[10];
//splitting string into words
int parsestr(string str)
{ 
 int index = 0;
 int i = 0;
 int maxlength = str.length();
 int wordcnt = 0;
 while(i < maxlength)
 {
 if(str[i] != ' ')
 {
 word[index] = word[index] + str[i];
 }
 else
 {
 index++; //new word
 wordcnt = index;
 }
 i++;
 }
 return wordcnt;
}
//find the max word count out of the array and return the word corresponding to that index.
string maxrepeatedWord(int wordcntArr[],int count)
{
 int max = 0;
 int index = 0;
 for(int i = 0; i <= count; i++)
 {
 if(wordcntArr[i] > max)
 {
 max = wordcntArr[i];
 index = i;
 }
 }
 return word[index];
}
void countwords(int count)
{
 int wordcnt = 0;
 int wordcntArr[10];
 string maxrepeatedword;
 for(int i = 0; i <= count; i++)
 {
 for(int j = 0; j <= count; j++)
 {
 if(word[i] == word[j])
 {
 wordcnt++;
 //word[j] = "";
 }
 else
 {}
 }
 cout << " word " << word[i] << " occurs " << wordcnt << " times " << endl;
 wordcntArr[i] = wordcnt;
 wordcnt = 0;
 }
 maxrepeatedword = maxrepeatedWord(wordcntArr,count);
 cout << " Max Repeated Word is " << maxrepeatedword;
}
int main()
{
 string str = "I am am am good good";
 int wordcount = 0;
 wordcount = parsestr(str);
 countwords(wordcount);
}

Question 2

Here's a long list of feedback. Sorry for the blunt tone, but I'm trying to be brief:

parsestr mutates a global variable but returns a value that's relevant to that variable (the count). That's inconsistent. Either have it return both the count and the array (or make things easier on yourself and use a vector that knows its size), or make the count global too.
You don't handle punctuation characters. Is that a requirement? Should
"a(b)" be one word? Should "a. a" be two unique words?
Do you need to split on characters other than space? What about \t or \n? Other Unicode whitespace characters?
Your while(i < maxlength) loop could more simply be a for loop.
Building up the word by incrementally contatenating characters onto a string is slow. It will periodically require dynamic allocation. A more efficient solution would be to remember the start index of the word. When you reach the end, create a string from the substring (start, end) in one step.

Going further from there, there's no reason to even store those words separately since that start, end pair is enough to reconstruct it as needed.
Your program will mysteriously crash if you pass a sentence with more than ten words. :( Since you're using dynamic allocation everywhere else (string does that under the hood) there's no reason to use a fixed array for words. At the very least, you should have a constant for that size and check that you don't overflow the array.
index isn't a helpful variable name. Call it wordIndex.
Likewise, wordcnt would be better as wordCount.
maxrepeatedWord is a mixture of camelCase and all lowercase. Be consistent (and camelCase is generally better since it makes it easier for the reader to split it into words).
wordcntArr would be better as wordCounts. The fact that it's an array is self-evident.
i <= count should be just <. Arrays are zero-based, so wordcntArr[count] is past the last valid element.
index -> indexOfMax.
You're passing count into countwords even though that count is relevant to a global variable that it accesses directly. Why not just make the count global too?
The else {} accomplishes nothing. Remove it.
If you declare wordcnt inside the for(int i... loop, you won't have to reinitialize it each iteration.
int wordcntArr[10] again duplicates the magical literal 10. Use a constant or, better, a dynamically-sized container.
You're redundantly recounting each word every time it occurs. With "I am am am good good", you'll get a count array like 1 3 3 3 2 2. If you had a collection of unique words instead, your \$O(n^2)\$ complexity would go down to \$O(mn)\$ where \$m\$ is the number of unique words.

A hash table would be a good way to create the set of unique words.
Your test string assumes repeated words will be contiguous (as opposed to, say "I am good am good am."). Is that intentional? Desirable?

At a high level, your algorithm is also not optimal. You've got \$O(n^2)\$ performance, ignoring the dynamic allocation as you incrementally build up those strings. With that, it gets worse.

I believe the canonical solution to this is (roughly):

create a map of words -> counts
set wordStart to 0
iterate through the string
 if the current character is a word delimiter
 set word to the substring from wordStart to here
 set wordStart to here
 if map contains word
 increment the count for that word
 else
 add the word to the map with a count of 1
 end
 end
end
set maxWord to null
set max to 0
iterate through the map
 if count for this word > max
 set maxWord to this word
 set max to count
 end
end
return maxWord

That gives you \$O(n)\$: you only walk the entire string once.

Question 3

Some general comments first:

Comment, comment, comment. Although it seems pretty clear now, it doesn't hurt to make it really clear for the guy who has to maintain it when you've moved on.
You might want to have separate functions for doing the counting of the words and for displaying the counts of the words.

C++ Comments:

You are using C++, so you might want to put this in a Class.
The Standard Library is your friend. In particular, you might want to see if std::map<...> could make your life a little easier when counting words.
The use of string word[10] makes an assumption of how many different words there will be. It would be better to allow an arbitrary number of different words by using something like a std::vector. Alternatively, a different approach could be used (see #4).
string.find(..) and string.sub_string(..) might prove helpful.

Question 4

Clarify the requirements

There are two aspects of the problem statement that are unclear (that's quite probably intentional, given that this is an interview question).

What constitutes a "word" here? Splitting the string on spaces is one possibility, but means that a word bounded by punctuation will be considered different to the same word standing alone. We can't completely ignore punctuation, though - cant and can't are completely different words, for example.
What if there isn't a single most frequent word? There might be a draw, or there might be no words in the input at all. What should we return in those cases? (My recommendation: always return a container (e.g. std::vector) of results, and let the caller choose what to do if it's length is not 1.)

If you're not in a position to obtain answers to these questions, it's important to include comments indicating the assumptions you've made (i.e. your guesses at the answers).

munificent munificent 1,33411 silver badges9 bronze badges · Answer 1 · 2011-03-23 18:40:46Z

Here's a long list of feedback. Sorry for the blunt tone, but I'm trying to be brief:

parsestr mutates a global variable but returns a value that's relevant to that variable (the count). That's inconsistent. Either have it return both the count and the array (or make things easier on yourself and use a vector that knows its size), or make the count global too.
You don't handle punctuation characters. Is that a requirement? Should
"a(b)" be one word? Should "a. a" be two unique words?
Do you need to split on characters other than space? What about \t or \n? Other Unicode whitespace characters?
Your while(i < maxlength) loop could more simply be a for loop.
Building up the word by incrementally contatenating characters onto a string is slow. It will periodically require dynamic allocation. A more efficient solution would be to remember the start index of the word. When you reach the end, create a string from the substring (start, end) in one step.

Going further from there, there's no reason to even store those words separately since that start, end pair is enough to reconstruct it as needed.
Your program will mysteriously crash if you pass a sentence with more than ten words. :( Since you're using dynamic allocation everywhere else (string does that under the hood) there's no reason to use a fixed array for words. At the very least, you should have a constant for that size and check that you don't overflow the array.
index isn't a helpful variable name. Call it wordIndex.
Likewise, wordcnt would be better as wordCount.
maxrepeatedWord is a mixture of camelCase and all lowercase. Be consistent (and camelCase is generally better since it makes it easier for the reader to split it into words).
wordcntArr would be better as wordCounts. The fact that it's an array is self-evident.
i <= count should be just <. Arrays are zero-based, so wordcntArr[count] is past the last valid element.
index -> indexOfMax.
You're passing count into countwords even though that count is relevant to a global variable that it accesses directly. Why not just make the count global too?
The else {} accomplishes nothing. Remove it.
If you declare wordcnt inside the for(int i... loop, you won't have to reinitialize it each iteration.
int wordcntArr[10] again duplicates the magical literal 10. Use a constant or, better, a dynamically-sized container.
You're redundantly recounting each word every time it occurs. With "I am am am good good", you'll get a count array like 1 3 3 3 2 2. If you had a collection of unique words instead, your \$O(n^2)\$ complexity would go down to \$O(mn)\$ where \$m\$ is the number of unique words.

A hash table would be a good way to create the set of unique words.
Your test string assumes repeated words will be contiguous (as opposed to, say "I am good am good am."). Is that intentional? Desirable?

At a high level, your algorithm is also not optimal. You've got \$O(n^2)\$ performance, ignoring the dynamic allocation as you incrementally build up those strings. With that, it gets worse.

I believe the canonical solution to this is (roughly):

create a map of words -> counts
set wordStart to 0
iterate through the string
 if the current character is a word delimiter
 set word to the substring from wordStart to here
 set wordStart to here
 if map contains word
 increment the count for that word
 else
 add the word to the map with a count of 1
 end
 end
end
set maxWord to null
set max to 0
iterate through the map
 if count for this word > max
 set maxWord to this word
 set max to count
 end
end
return maxWord

That gives you \$O(n)\$: you only walk the entire string once.

jwernerny jwernerny 3011 silver badge3 bronze badges · Answer 2 · 2011-03-23 17:57:38Z

Some general comments first:

Comment, comment, comment. Although it seems pretty clear now, it doesn't hurt to make it really clear for the guy who has to maintain it when you've moved on.
You might want to have separate functions for doing the counting of the words and for displaying the counts of the words.

C++ Comments:

You are using C++, so you might want to put this in a Class.
The Standard Library is your friend. In particular, you might want to see if std::map<...> could make your life a little easier when counting words.
The use of string word[10] makes an assumption of how many different words there will be. It would be better to allow an arbitrary number of different words by using something like a std::vector. Alternatively, a different approach could be used (see #4).
string.find(..) and string.sub_string(..) might prove helpful.

Toby Speight Toby Speight 87.9k14 gold badges104 silver badges325 bronze badges · Answer 3 · 2019-04-12 09:59:00Z

Clarify the requirements

There are two aspects of the problem statement that are unclear (that's quite probably intentional, given that this is an interview question).

What constitutes a "word" here? Splitting the string on spaces is one possibility, but means that a word bounded by punctuation will be considered different to the same word standing alone. We can't completely ignore punctuation, though - cant and can't are completely different words, for example.
What if there isn't a single most frequent word? There might be a draw, or there might be no words in the input at all. What should we return in those cases? (My recommendation: always return a container (e.g. std::vector) of results, and let the caller choose what to do if it's length is not 1.)

If you're not in a position to obtain answers to these questions, it's important to include comments indicating the assumptions you've made (i.e. your guesses at the answers).

Stack Exchange Network

Max repeated word in a string

3 Answers 3

Clarify the requirements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Max repeated word in a string

3 Answers 3

Clarify the requirements

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions