I was trying to do a very common interview problem "Finding the max repeated word in a string " and could not find much resources in net for c/c++ implementation. So I coded it myself here. I have tried to do most of the coding from scratch for better understanding. Could you review my code and provide comments on my algorithm. Some people have suggested using hashtables for storing the count, but am not using hashtables here.
#include<stdafx.h>
#include<stdlib.h>
#include<stdio.h>
#include<string>
#include<iostream>
using namespace std;
string word[10];
//splitting string into words
int parsestr(string str)
{
int index = 0;
int i = 0;
int maxlength = str.length();
int wordcnt = 0;
while(i < maxlength)
{
if(str[i] != ' ')
{
word[index] = word[index] + str[i];
}
else
{
index++; //new word
wordcnt = index;
}
i++;
}
return wordcnt;
}
//find the max word count out of the array and return the word corresponding to that index.
string maxrepeatedWord(int wordcntArr[],int count)
{
int max = 0;
int index = 0;
for(int i = 0; i <= count; i++)
{
if(wordcntArr[i] > max)
{
max = wordcntArr[i];
index = i;
}
}
return word[index];
}
void countwords(int count)
{
int wordcnt = 0;
int wordcntArr[10];
string maxrepeatedword;
for(int i = 0; i <= count; i++)
{
for(int j = 0; j <= count; j++)
{
if(word[i] == word[j])
{
wordcnt++;
//word[j] = "";
}
else
{}
}
cout << " word " << word[i] << " occurs " << wordcnt << " times " << endl;
wordcntArr[i] = wordcnt;
wordcnt = 0;
}
maxrepeatedword = maxrepeatedWord(wordcntArr,count);
cout << " Max Repeated Word is " << maxrepeatedword;
}
int main()
{
string str = "I am am am good good";
int wordcount = 0;
wordcount = parsestr(str);
countwords(wordcount);
}
3 Answers 3
Here's a long list of feedback. Sorry for the blunt tone, but I'm trying to be brief:
parsestr
mutates a global variable but returns a value that's relevant to that variable (the count). That's inconsistent. Either have it return both the count and the array (or make things easier on yourself and use avector
that knows its size), or make the count global too.You don't handle punctuation characters. Is that a requirement? Should
"a(b)" be one word? Should "a. a" be two unique words?Do you need to split on characters other than space? What about
\t
or\n
? Other Unicode whitespace characters?Your
while(i < maxlength)
loop could more simply be afor
loop.Building up the word by incrementally contatenating characters onto a
string
is slow. It will periodically require dynamic allocation. A more efficient solution would be to remember the start index of the word. When you reach the end, create a string from the substring(start, end)
in one step.Going further from there, there's no reason to even store those words separately since that start, end pair is enough to reconstruct it as needed.
Your program will mysteriously crash if you pass a sentence with more than ten words. :( Since you're using dynamic allocation everywhere else (
string
does that under the hood) there's no reason to use a fixed array for words. At the very least, you should have a constant for that size and check that you don't overflow the array.index
isn't a helpful variable name. Call itwordIndex
.Likewise,
wordcnt
would be better aswordCount
.maxrepeatedWord
is a mixture of camelCase and all lowercase. Be consistent (and camelCase is generally better since it makes it easier for the reader to split it into words).wordcntArr
would be better aswordCounts
. The fact that it's an array is self-evident.i <= count
should be just<
. Arrays are zero-based, sowordcntArr[count]
is past the last valid element.index
->indexOfMax
.You're passing
count
intocountwords
even though that count is relevant to a global variable that it accesses directly. Why not just make the count global too?The
else {}
accomplishes nothing. Remove it.If you declare
wordcnt
inside thefor(int i...
loop, you won't have to reinitialize it each iteration.int wordcntArr[10]
again duplicates the magical literal10
. Use a constant or, better, a dynamically-sized container.You're redundantly recounting each word every time it occurs. With
"I am am am good good"
, you'll get a count array like1 3 3 3 2 2
. If you had a collection of unique words instead, your \$O(n^2)\$ complexity would go down to \$O(mn)\$ where \$m\$ is the number of unique words.A hash table would be a good way to create the set of unique words.
Your test string assumes repeated words will be contiguous (as opposed to, say
"I am good am good am."
). Is that intentional? Desirable?
At a high level, your algorithm is also not optimal. You've got \$O(n^2)\$ performance, ignoring the dynamic allocation as you incrementally build up those strings. With that, it gets worse.
I believe the canonical solution to this is (roughly):
create a map of words -> counts
set wordStart to 0
iterate through the string
if the current character is a word delimiter
set word to the substring from wordStart to here
set wordStart to here
if map contains word
increment the count for that word
else
add the word to the map with a count of 1
end
end
end
set maxWord to null
set max to 0
iterate through the map
if count for this word > max
set maxWord to this word
set max to count
end
end
return maxWord
That gives you \$O(n)\$: you only walk the entire string once.
Some general comments first:
Comment, comment, comment. Although it seems pretty clear now, it doesn't hurt to make it really clear for the guy who has to maintain it when you've moved on.
You might want to have separate functions for doing the counting of the words and for displaying the counts of the words.
C++ Comments:
You are using C++, so you might want to put this in a Class.
The Standard Library is your friend. In particular, you might want to see if
std::map<...>
could make your life a little easier when counting words.The use of
string word[10]
makes an assumption of how many different words there will be. It would be better to allow an arbitrary number of different words by using something like astd::vector
. Alternatively, a different approach could be used (see #4).string.find(..)
andstring.sub_string(..)
might prove helpful.
Clarify the requirements
There are two aspects of the problem statement that are unclear (that's quite probably intentional, given that this is an interview question).
What constitutes a "word" here? Splitting the string on spaces is one possibility, but means that a word bounded by punctuation will be considered different to the same word standing alone. We can't completely ignore punctuation, though -
cant
andcan't
are completely different words, for example.What if there isn't a single most frequent word? There might be a draw, or there might be no words in the input at all. What should we return in those cases? (My recommendation: always return a container (e.g.
std::vector
) of results, and let the caller choose what to do if it's length is not 1.)
If you're not in a position to obtain answers to these questions, it's important to include comments indicating the assumptions you've made (i.e. your guesses at the answers).