I am trying to come up with a better solution to the following problem:
Statement
You are supposed to write a function which takes the contents of a document in string format and also a whole number, n, providing how many results to return. (So that's two arguments for the function.) The function should return the n most frequently occurring words in descending order. The algorithm should be \$O(N)\,ドル not \$O(N \cdot N)\$.
My Current Solution
function freq(str, n) {
var results = [];
var strArray = str.split(/\s+/), countObj = {}, wordArray = [], wordObj = {};
strArray.forEach(function(el) {
countObj[el] ? countObj[el] += 1 : countObj[el] = 1;
});
for(var el in countObj) {
wordObj[countObj[el]] ? wordObj[countObj[el]].push(el) : wordObj[countObj[el]] = [el];
}
for(var el in wordObj){
wordArray.push(wordObj[el].join(' '));
}
for(var i = 0; i < n; i++) {
results.push(wordArray[wordArray.length - 1 - i]);
}
return results.join('\n');
};
Time Complexity
The solution seems to be \$O(N)\$. Are there any improvements I could make?
Best Practices
I am using ternary operators a lot for the if-else construct. Am I overusing them?
General
Any suggestions regarding how I can make this solution 'production grade' quality?
4 Answers 4
First of all, there's a bug: You're relying on the properties in wordObj
being iterated in ascending order when creating wordArray
. I.e. wordObj['1']
is handled before wordObj['3']
and so on. This behavior is not guaranteed!
Objects are unordered in JS. Some engines do order them, but you cannot rely on it. Even engines that do order properties are under no obligation to sort them in any particular way. They might be in descending order, or sorted lexicographically (which would make sense since keys are strings, but it'd sort 10
before 2
), or something else entirely.
Secondly, it just seems like you're working too hard here.
If there's a "tie" between to words (say "foo" and "bar" both appear 3 times), you return all of them on one line. The problem description does not explicitly require you to handle ties in any particular way. However, your solution basically means that you're not actually returning N words. A line of words separated by spaces is not "a word", but if it was, there'd be more than N words in total.
So I'd simply leave that out.
Besides, it's simpler to use reduce
, sort
and slice
. You can also use a ||
"failover" instead of ternaries.
function freq(str, n) {
var words = str.split(/\s+/);
// Count occurrences
var counts = words.reduce(function (memo, word) {
memo[word] = (memo[word] || 0) + 1;
return memo;
}, {});
// Sort by count (descending)
var sorted = Object.keys(counts).sort(function (a, b) {
return counts[b] - counts[a];
});
// Return the first n words, separated by linebreaks
return sorted.slice(0, n).join("\n");
}
Edit: As rolfl points out, the input should probably be trimmed before anything else, or you might get empty strings in your words array. However, to avoid empty strings being counted as words, without changing the code above too much, you can do this:
var counts = words.reduce(function (memo, word) {
if(word) { // ignore empty strings
memo[word] = (memo[word] || 0) + 1;
}
return memo;
}, {});
Edit: As was pointed out, the above isn't actually O(n) due to its use of sort
. Bummer. This one should be closer, although not as elegant. It's closer to the original code in some ways, but still avoids assuming that object properties are ordered.
function freq(str, n) {
var words = str.split(/\s+/);
// Count words (same as before)
var wordCounts = words.reduce(function (memo, word) {
if(word) {
memo[word] = (memo[word] || 0) + 1;
}
return memo;
}, {});
// Group words into nested arrays that are, in turn,
// indexed by the word count. I.e. wordsIndexedByCount[3]
// will be a nested array of all the words that occurred
// 3 times.
// Also keep track of the highest index we use (maxCount).
var wordsIndexedByCount = [],
maxCount = 0;
for(var word in wordCounts) {
var count = wordCounts[word];
maxCount = count > maxCount ? count : maxCount;
wordsIndexedByCount[count] = wordsIndexedByCount[count] || [];
wordsIndexedByCount[count].push(word);
}
// Flatten the array from above, starting with the most
// often occurring word(s) - i.e. the highest index.
var sorted = [];
for(var i = maxCount; i > 0; i--) {
// Skip "holes" in the array (not that it's necessary; the
// code works without this check, since push.apply ignores
// an undefined argument)
if(!wordsIndexedByCount[i]) {
continue;
}
[].push.apply(sorted, wordsIndexedByCount[i]);
if(sorted.length >= n) {
break; // if we've got enough to return, there's no reason to keep going
}
}
return sorted.slice(0, n).join("\n");
}
-
3\$\begingroup\$ Doesn't the sort() make it O(n log n) though? \$\endgroup\$oals– oals2014年11月02日 21:04:50 +00:00Commented Nov 2, 2014 at 21:04
-
\$\begingroup\$ Would the time complexity still be O(N) after using the sort function? \$\endgroup\$ng-hacker-319– ng-hacker-3192014年11月02日 21:06:16 +00:00Commented Nov 2, 2014 at 21:06
-
\$\begingroup\$ @ng-hacker-319 As oals points out in the comment before yours: Probably not, no. However, given that objects are unordered (and they are - no way around that), sorting seems like the best option. But yes, we'd be getting away from O(n). I might try coming up with a solution that doesn't use
sort
, just for fun. \$\endgroup\$Flambino– Flambino2014年11月02日 21:23:17 +00:00Commented Nov 2, 2014 at 21:23 -
\$\begingroup\$ @ng-hacker-319 Added alternate code \$\endgroup\$Flambino– Flambino2014年11月02日 21:51:43 +00:00Commented Nov 2, 2014 at 21:51
I am using ternary operators a lot for the if-else construct. Am I overusing them?
No, you're not overusing them. You're abusing them.
Ternary operators are not a 'short-hand if', as you seem to believe they are.
You're using them as
expression ? statement : statement
Their correct use is
lvalue = expression ? expression : expression
If you're using a ternary expression without an assignment operator (or function call, or return
) on their left side, it's outright bad style. It's even more wrong to call a data-modifying function such as push()
inside a ternary operator.
// good style
var status = (age >= 18) ? "adult" : "minor";
// awful style
(age >= 18) ? status = "adult" : status = "minor";
// you write
countObj[el] ? countObj[el] += 1 : countObj[el] = 1;
// better
countObj[el] = countObj[el] || 0;
countObj[el]++;
// you write
wordObj[countObj[el]] ? wordObj[countObj[el]].push(el) : wordObj[countObj[el]] = [el];
// better
word = countObj[el];
wordObj[word] = wordObj[word] || [];
wordObj[word].push(el)
-
\$\begingroup\$ I realise JS lacks an
||=
operator for some reason; I hope the intent is still clear. \$\endgroup\$oals– oals2014年11月02日 18:06:22 +00:00Commented Nov 2, 2014 at 18:06 -
1\$\begingroup\$ You can just edit your answer. Intent may be clear if you already know the correct syntax; may not be so clear otherwise. \$\endgroup\$Flambino– Flambino2014年11月02日 19:03:02 +00:00Commented Nov 2, 2014 at 19:03
Your var
declaration should put each variable on its own line, and there's no need for the separate declaration for results
:
var results = [],
strArray = str.split(/\s+/),
countObj = {},
wordArray = [],
wordObj = {};
These variables mostly have misleading names. There is no need to add the data type to the variable name. strArray
could be inputWords
, countObj
could be wordCounts
, wordObj
could be wordsByCount
and wordArray
could be wordsInCountOrder
.
Names like I suggest make the code self-documenting.
function freq(str, n) {
var results = [],
inputWords = str.trim().split(/\s+/),
wordCounts = {},
wordsByCount = {},
wordsInOrder = [];
inputWords.forEach(function(word) {
wordCounts[word] ? wordCounts[word] += 1 : wordCounts[word] = 1;
});
for (var word in wordCounts) {
wordsByCount[wordCounts[word]] ? wordsByCount[wordCounts[word]].push(word) : wordsByCount[wordCounts[word]] = [word];
}
for (var count in wordsByCount) {
wordsInOrder.push(wordsByCount[count]);
}
for (var i = wordsInOrder.length - 1; i >= 0; i--) {
results = results.concat(wordsInOrder[i]);
if (results.length >= n) {
break;
}
}
return results;
};
function textUpdated(source) {
var count = parseInt(document.getElementById("topcount").value)
document.getElementById("outwords").value = freq(source.value, count).join(" -> ");
document.getElementById("countrep").value = count;
document.getElementById("last").value = "text";
}
function countUpdated(source) {
var data = document.getElementById("data").value
document.getElementById("outwords").value = freq(data.value, parseInt(source.value)).join(" -> ");
document.getElementById("countrep").value = source.value;
document.getElementById("last").value = "count";
}
Enter text:
<input type="Text" name="Input" id="data" size=100 onInput="textUpdated(this);" />
<p>
<input id="topcount" type="number" min="1" max="30" step="1" value="5" onInput="countUpdated(this);" />
<hr>Common Words:
<input type="Text" id="outwords" name="OutWord" value="none" size=100 rows="10" cols="30" />
<br>Count:
<input type="Text" id="countrep" name="OutCount" value="none" />Last:
<input type="Text" id="last" name="OutLast" value="none" />
<input type="Text" id="debug" name="Debug" value="none" size=100 rows="10" cols="100" />
There are a few other items here to note.
- Firstly, I trim the input string. Spaces at the beginning or end of the string introduce the empty-string word otherwise.
- I don't list same-count words as one string with spaces joining them. Instead, I return the op N words or more, if there's a tie for last place.
As for the complexity, yes, you are right, the complexity is \$O(n)\$
-
\$\begingroup\$ Love this answer, it pushed me to build my own ;) However, you might return more than
n
entries. For example with eerste eerste eerste derde derde derde zesde zesde zesde 12 12 12 23 23 23 45 45 45 and n = 5 \$\endgroup\$konijn– konijn2014年11月04日 19:32:12 +00:00Commented Nov 4, 2014 at 19:32 -
\$\begingroup\$ @konijn - Yeah, I was aware of that, I return the top N, or more if there's a tie for last place (I say so in the answer). I could not think of a good way to make the cut-off 'fair' otherwise. \$\endgroup\$rolfl– rolfl2014年11月04日 19:57:11 +00:00Commented Nov 4, 2014 at 19:57
-
\$\begingroup\$
return results.slice(0,n);
You are also counting on ordered objects infor (var count in wordsByCount) {
\$\endgroup\$konijn– konijn2014年11月04日 19:58:11 +00:00Commented Nov 4, 2014 at 19:58
A late answer,
I want to start with comparing your two requirements:
- 'production grade' quality
- should be \$O(n)\$
The first solution of @Flambino is so neat, that the volume of data would have to be humongous before writing an \$O(n)\$ solution. In fact, I am not even sure that the \$O(n)\$ solution is guaranteed to be faster with large datasets.
I wanted to write my own version, which really has 2 major differences:
- It doesnt require tracking the most common words
- It has an eye stabbing syntax shortcut for setting
result
, you should really use anif
with curly braces there ;)
function findFrequentWords(text, n) {
//Variables
var words = text.trim().split(/\s+/),
frequencies = {},
orderedFrequencies = [],
word, frequency, result = [];
//Derive word frequencies, key is word
words.reduce( updateFrequency, frequencies );
//Convert word frequencies to an array, key is the occurence count
for( word in frequencies ){
frequency = frequencies[word];
( orderedFrequencies[frequency] = orderedFrequencies[frequency] || [] ).push( word );
}
//Keep popping most common words until n is met
while( result.length < n && orderedFrequencies.length ){
//Only assign actual words, not `undefined` which can happen due to JS filling up holes
( words = orderedFrequencies.pop() ) && ( result = result.concat( words ) );
}
return result.slice(0,n);
}
//Helper function for words frequencies determination
function updateFrequency( frequencies, word ){
frequencies[ word ] = (frequencies[ word ] || 0) + 1;
return frequencies;
}
var text = document.getElementById("text").value;
document.getElementById("output").innerText = findFrequentWords( text, 5 );
textarea { width:100%; height: 30% }
n: 5
Common Words:
<span id="output"></span>
<textarea id="text" rows="15">
Not like the brazen giant of Greek fame,
With conquering limbs astride from land to land;
Here at our sea-washed, sunset gates shall stand
A mighty woman with a torch, whose flame
Is the imprisoned lightning, and her name
Mother of Exiles. From her beacon-hand
Glows world-wide welcome; her mild eyes command
The air-bridged harbor that twin cities frame.
"Keep, ancient lands, your storied pomp!" cries she
With silent lips. "Give me your tired, your poor,
Your huddled masses yearning to breathe free,
The wretched refuse of your teeming shore.
Send these, the homeless, tempest-tost to me,
I lift my lamp beside the golden door!"
</textarea><br>
Explore related questions
See similar questions with these tags.