Write a function that takes two parameters:
- a
String
representing a text document- an integer providing the number of items to return
Implement the function such that it returns a list of Strings ordered by word frequency, the most frequently occurring word first.
Runtime not to exceed \$O(n)\$
I decided to use a map to count the frequency of the words and then do the count sort to achieve sorting in linear time. Is this the best approach or there is some other solution that I am not aware of?
import java.util.Arrays;
import java.util.HashMap;
import java.util.Map;
interface IFrequencyCounter{
public String[] getMostFrequenctWords(String content, int numberOfwords );
}
public class FrequencyCounter implements IFrequencyCounter{
/**
* This is the Frequency Class
* which has only one property count.
* This is used to save autoboxing/unboxing
* overhead when using maps.
*/
private class Frequency{
private Frequency(){
this.count = 1;
}
private int count;
public int getFrequency(){
return this.count;
}
public void incrementFrequency(){
this.count++;
}
public void setCount(int count){
this.count = count;
}
}
/**
* Token class extends Frequency. This is
* used to maintain word frequency relationship.
* (Just like a pair.)
*/
private class Token extends Frequency{
private Token(String word, int count){
super();
this.word = word;
setCount(count);
}
private String word;
}
/**
* This method returns String array sorted by most frequent word. If null/empty string
* is provided as input then empty array is returned. If numberOfWords is greater then
* unique words then all the words are returned.
*
* @see com.evernote.util.IFrequencyCounter#getMostFrequenctWords(java.lang.String, int)
*/
@Override
public String[] getMostFrequenctWords(String content, int numberOfwords) {
// basic validations
if( null == content) content = "";
if(numberOfwords <= 0) return new String[0];
int maxSofar = 0;
HashMap<String,Frequency> wordMap = new HashMap<String,Frequency>();
String[] contentArray = content.split("\\s+");
addWordsToMap(contentArray, wordMap);
return getSortedArray(numberOfwords, maxSofar, wordMap);
}
/**
* returns sorted array of words(String) in decreasing order of frequency.
* @param numberOfwords
* @param i
* @param maxSofar
* @param wordMap
* @param wordsArr
* @return String[]
*/
private String[] getSortedArray(int numberOfwords, int maxSofar, Map<String, Frequency> wordMap) {
String[] mostFreqWordsArr;
int i =0;
Token[] wordsArr = new Token[wordMap.keySet().size()];
for (String key : wordMap.keySet()) {
wordsArr[i++] = new Token(key, wordMap.get(key).getFrequency());
if(maxSofar < wordMap.get(key).getFrequency()){
maxSofar = wordMap.get(key).getFrequency();
}
}
wordMap = null; // just to free memory in case input is a very large string
int[] frequencyArr = new int[maxSofar+1];
String[] stringArr = new String[wordsArr.length];
for(i =0; i<wordsArr.length; i++) frequencyArr[wordsArr[i].getFrequency()] += 1;
for(i =1; i<frequencyArr.length; i++) frequencyArr[i] += frequencyArr[i-1];
for(i= 0; i<wordsArr.length; i++) {
stringArr[frequencyArr[wordsArr[i].getFrequency()]-1] = wordsArr[i].word;
frequencyArr[wordsArr[i].getFrequency()] -=1;
}
if(stringArr.length-numberOfwords >= 0)
mostFreqWordsArr = Arrays.copyOfRange(stringArr, stringArr.length-numberOfwords, stringArr.length);
else
mostFreqWordsArr = Arrays.copyOfRange(stringArr, 0, stringArr.length);
// reverse the string so most frequent words come first
for(i = 0; i < mostFreqWordsArr.length / 2; i++){
String temp = mostFreqWordsArr[i];
mostFreqWordsArr[i] = mostFreqWordsArr[mostFreqWordsArr.length-1 - i];
mostFreqWordsArr[mostFreqWordsArr.length-1 - i] = temp;
}
return mostFreqWordsArr;
}
/**
* @param contentArray
* @param wordMap
*/
private void addWordsToMap(String[] contentArray, Map<String, Frequency> wordMap) {
for (String string : contentArray) {
if(wordMap.containsKey(string)){
wordMap.get(string).incrementFrequency();
}else {
Frequency token = new Frequency();
wordMap.put(string,token);
}
}
}
}
-
\$\begingroup\$ Nicely done for homework. \$\endgroup\$chillworld– chillworld2014年04月26日 06:46:34 +00:00Commented Apr 26, 2014 at 6:46
-
\$\begingroup\$ Lets say if I want to use this code in production then are there any changes which I should make ? \$\endgroup\$wayfare– wayfare2014年04月27日 01:06:02 +00:00Commented Apr 27, 2014 at 1:06
-
\$\begingroup\$ That depends on you, if it works you can, if you want to go for more perfection, listen to the answers and refactor your code to it. \$\endgroup\$chillworld– chillworld2014年04月27日 04:54:54 +00:00Commented Apr 27, 2014 at 4:54
-
\$\begingroup\$ I believe the suggested solution runs in o(nlogn) complexity at the worst case due to the sorting of the map entries so seems to me it fails to answer the question. \$\endgroup\$user87619– user876192015年10月22日 14:18:38 +00:00Commented Oct 22, 2015 at 14:18
2 Answers 2
There are a number of things in here you have done well, and your algorithm is good. I like that you have created an object to contain the frequency. Using the HashMap is the right solution too.
Overall, the OOP, and algorithm is good, but.... the presentation is messy, and there's a couple of things you can do better.
The indentation and code style in general are very inconsistent. I imagine you are using an IDE to write your code (based on the boilerplate JavaDoc comments), so, there is no real excuse for the inconsistent indentation. Maybe you had a problem pasting it in to Code Review... (I see this is your first question). There's a trick in Code Review (and all Stack Exchange sites). When editing your question, you can select the code, and click the "code" button, or press Ctrl-k to indent things 4 spaces to get the formatting right.
Code Style
The indentation is a problem. Part of it is probably the pasting and manual fix up you may have done. But, even then, there are some lines indented one space more than I expect.
The bigger issue is the consistency and conformance-to-standards of where you declare your variables.
It is standard in Java for the class to have the class declaration, static fields, static methods, instance fields, the constructor, then public methods, finally private methods. There are some variations, but, the private variables should always be declared before the constructor....
While we are looking at this class, the variable should be final, and there is no need for the call to super()
in the constructor.
The code:
private class Token extends Frequency{ private Token(String word, int count){ super(); this.word = word; setCount(count); } private String word; }
should be:
private class Token extends Frequency{
private String word;
private Token(String word, int count){
this.word = word;
setCount(count);
}
}
Frequency
The good thing about Frequency is that you are using it. You are correct when you say it is better than the autoboxing system (keeping an Integer in the HashMap). I have to question why you felt it was necessary to add the intermediate class Token though. There is no need for them both. Choose one, and merge the other in to it. For example, there is no reason why you can't add the String value to the Frequency class. As it stands, the logic is a little bit too abstracted.
Map efficiency
The code
for (String string : contentArray) { if(wordMap.containsKey(string)){ wordMap.get(string).incrementFrequency(); }else { Frequency token = new Frequency(); wordMap.put(string,token); } } }
is not doing efficient HashMap lookups. Consider rewriting it like (only one lookup to the map):
for (String string : contentArray) {
Frequency token = wordMap.get(string);
if(token == null) {
wordMap.put(string,new Frequency());
}else {
token.incrementFrequency();
}
}
Other....
- The variable
int maxSofar = 0;
and the maxSoFar parameter is useless, just initialize maxSoFar inside thegetSortedArray
method. - Using a variable name
string
is a bad idea.
-
\$\begingroup\$ Woha.. awesome review. I liked it. Yes the formatting is bad because I did it after pasting and pressing space bar 4 times for most of the lines. Sorry for that. Also why is the other look up in may is better than the previous one ? Also I made two classes because I wanted proper abstraction. String class wasn't needed in map so I created two classes. \$\endgroup\$wayfare– wayfare2014年04月26日 00:32:05 +00:00Commented Apr 26, 2014 at 0:32
Naming
You should give variables and methods proper names:
Consistency
Your Frequency
class contains a getFrequency
method and an incrementFrequency
method, but also a count
variable and a setCount
method.
Either rename count
to frequency
or the other way around (not including the class name of course).
Typos
It is a very awkward "oopsy" moment when you find a typo in your code, even more so if it is a method name which is already in use or overridden - so re-read your code and try to find them before they become harder to replace... (getMostFrequenctWords
should be getMostFrequentWords
)
Convey meaning
Your variable/parameter names should convey the meaning of what do they hold, and how they should be used.
numberOfWords
is maybe technically accurate, but does not help the reader know what it is for. Perhaps maxResults
would do a better job.
getSortedArray
is also a little too generic, and sortWordsByFrequency
might be more appropriate.
addWordsToMap
is also very technical, and I think that Map countWords(String[])
is more readable, and gives a better flow to the calling code.
Don't be lazy
In most cases it is a bad idea to shorten words in variable names, as they make the name less readable - don't use Freq
- say Frequency
.
Don't repeat type in the name
There is no need in repeating the type of a variable in its name - mostFrequentWords
does not reduce anything by omitting the word Arr
from it.
Comments
When you have good naming to your classes, methods and parameters, you will find that most method comments are simply re-iterations of the method signatures. Likewise, when you find that a method/parameter needs extra explanation in a comment, it might be because it has a bad name.
So, String[] getMostFrequentWords(String content, int maxResults)
is very self-descriptive, and doesn't actually need any documentation.
Comments have a nasty habit of rotting - when you change the code, you tend not the maintain the comments, which after awhile will stop describing the code, and sometimes even totally mislead the reader. (see @param wordsArr
, for example)
Encapsulation
Keep internal data of a method inside it - don't expose it in your signature. maxSofar
and wordMap
have no business in the method signatures. One is totally internal, and the other should be instantiated within the method and returned rather than passed by reference.
Declare variables only when they are needed
There is no need to declare variables at the beginning of a method if they are only used at its end.
-
\$\begingroup\$ Very nice review. Thanks a lot Uri Agassi. I never thought it that way you mentioned and gave me a good insight to enhance my way of writing code. A great thanks to you again . \$\endgroup\$wayfare– wayfare2014年04月27日 01:10:08 +00:00Commented Apr 27, 2014 at 1:10