I wrote this script for a code challenge on TreeHouse. I already submitted my code, but I am looking for an after the fact feedback.
The challenge was to open a file of text, remove specific words (provided in another file) and do the following:
Output to the screen in a readable fashion
1. total word count after filtering
2. highest occurring word
3. longest word(s) and its / their length
4. sort the list from most occurring to least occurring then output that data to the screen as an unordered list
5. Bonus: Return the word list in JSON.
I decided to split the solution into two classes:
- One class to sanitise the string into words only
- Another class to do the analysis
This is the sanitiser class. Given text it should return a string of words only. It should remove anything that is not a letter or an apostrophe, and compress all white space into a single space character.
I accept that this has problems. For example, I found an instance of an apostrophe as a word. Perhaps it was a leading or a trailing one. But defining the boundaries of a word is a tricky proposition.
class WordsOnlySanitizer
# Allow for diacritics, hence p{Alpha} and not \w
# We should not split words on apostrophes either
WORDS_ONLY_REGEX = /[^\p{Alpha}']/i
# We want to reduce all white space into a single space
SPACE_ONLY_REGEX = /\s+/
def self.to_words(text)
text.gsub(WORDS_ONLY_REGEX, ' ').gsub(SPACE_ONLY_REGEX, ' ')
end
end
This is the analyser class. It's self explanatory. There is duplication in the longest_words
and highest_occuring_words
methods. But I'm not sure how to remove this duplication without making the code less readable.
The html_list
method also looks a little suspect, but I can't tell why.
require 'json'
class Analyser
def initialize(text, filter)
@words = text.split
@filter = filter.split
end
def word_count
filtered_words.size
end
def word_occurrences
@word_occurrences ||= filtered_words.inject(Hash.new(0)) do |result, word|
result[word] += 1
result
end
end
def highest_occurring_words
word_occurrences.group_by { |key, value| value }.max_by { |key, value| key }.last
end
def longest_words
filtered_words.inject({}) do |result, word|
result[word] = word.length
result
end.group_by { |key, value| value }.max_by { |key, value| key }.last
end
def html_list
list = ""
word_occurrences.sort_by { |key, value| value }.reverse.each do |key, value|
list << " <li>#{key}: #{value}</li>\n"
end
"<ul>\n" + list + "</ul>"
end
def json_list
JSON.parse(word_occurrences.to_json)
end
private
def filtered_words
@filtered_words ||= @words.reject do |word|
# Downcase so that Hello and hello count as two occurrences
word.downcase!
@filter.include?(word)
end
end
end
Usage
Here's how you would use this:
text = WordsOnlySanitizer.to_words(File.read('words.txt'))
filter = WordsOnlySanitizer.to_words(File.read('filter_words.txt'))
analyser = Analyser.new(text, filter)
puts "Word count after filtering is: #{analyser.word_count}"
puts "\n"
puts "The most frequent words are:"
analyser.highest_occurring_words.each do |key, value|
puts " - #{key}: #{value} occurences"
end
puts "\n"
puts "The longest words are:"
analyser.longest_words.each do |word|
puts " - #{word.first}: #{word.last} characters"
end
puts "\n"
puts "Word list:"
puts analyser.html_list
puts "JSON object:"
puts analyser.json_list
Here's a gist with all the files. Warning: There are large text files.
1 Answer 1
The first thing I noticed is that you've defined methods such as word_occurrences
, which do nothing but set instance variables (@word_occurrences
, etc.). This is a bit redundant, and it would be better to do one of two things:
- Define all of your instance variables under the
initialize
method. or - Take out the
@word_occurrences ||=
part of the method definition.
I think from a performance standpoint, it might be better to take the first approach, since you don't want to have to calculate word_occurrences
every time you call it as a method. I also noticed that you only use @words
and @filter
once, and that's to define filtered_words
, so you can make things simple by just defining @words
to be what filtered_words
is currently. See below:
def initialize(text, filter)
@words = text.split.reject {|word| filter.split.include? word.downcase}
@occurrences = @words.each_with_object(Hash.new(0)) {|word, result| result[word] += 1}
@lengths = @words.each_with_object({}) {|word, result| result[word] ||= word.length}
end
I've made @words
the new filtered_words
, shortened word_occurrences
to just @occurrences
(and made it an instance variable so that we're only evaluating it once, upon initialization of the instance), and added another instance variable @lengths
, which represents a part of your longest_words
function, a hash of each word and how long it is.
Also, notice that I simplified your use of inject
to create a hash of word occurrences and word lengths, by using the handy each_with_object
function instead.
I also changed =
to ||=
for the part where you're building up a hash of word lengths. I believe this will be better for performance, as you're not re-calculating the word length for words you already have.
You mentioned duplication between longest_words
and highest_occurring_words
-- you could solve this by pulling out the part you reused into a private function, which I would call highest_ranking
:
private
def highest_ranking(entries)
# Takes a hashmap of the form {"foo" => 100, "bar" => 45} and returns an array
# containing the entries (array-ified) with the highest number as a value.
entries.group_by{|word, occs| occs}.sort.last.last
end
I simplified max_by{|key, value| key}.last
to just sort.last.last
-- it turns out that Enumerable#sort
knows how to properly handle sorting by keys, so getting the highest one is just a matter of sorting the hash and grabbing the last item in the resulting array. (The extra last
is there for the same reason the last
is there in your function -- to drop the number of occurrences and just return the key/value entries in the result array)
Now that you have this useful helper function, finding the highest occurring words and longest words is simple:
public
def highest_occurring_words
highest_ranking @occurrences
end
def longest_words
highest_ranking @lengths
end
Finally, I would refactor your html_list
function like this:
def html_list
list = word_occurrences.sort_by {|word, occs| occs}.reverse.map do |word, occs|
" <li>#{word}: #{occs}</li>"
end.join("\n")
"<ul>\n" + list + "\n</ul>"
end
Instead of starting with list as ""
and iterating through each word/occurrences pair to add onto it a little at a time, you can take a functional approach and use map
to turn each word/occurrences pair into a string like " <li>antibubbles: 4</li>"
, then use join
to turn the resulting array into one long string with "\n" inserted in between each list item. I also inserted two spaces before each list item so that they appear indented in the HTML. :)
-
\$\begingroup\$ Thank you for the insightful answer. At further glance, do you think that the
initialize
method is doing too much? How would you test something like that? I feels hard to test, to me at least. \$\endgroup\$Mohamad– Mohamad2014年04月29日 03:10:38 +00:00Commented Apr 29, 2014 at 3:10 -
\$\begingroup\$ I've bundled this into a gem, taking into account some of your improvements. github.com/abitdodgy/words_counted Thanks for your input. \$\endgroup\$Mohamad– Mohamad2014年04月30日 15:34:14 +00:00Commented Apr 30, 2014 at 15:34
-
\$\begingroup\$ Your
initialize
method should set up your instance variables -- in this case,@words
,@lengths
and@occurrences
-- and very little else, if anything. A lot of the time you're just passing in arguments, like@name = name; @age = age
, etc. I think this is a good example of a case where you can do something more complex within theinitialize
method, while still keeping your code concise and readable. \$\endgroup\$Dave Yarwood– Dave Yarwood2014年04月30日 15:34:50 +00:00Commented Apr 30, 2014 at 15:34 -
\$\begingroup\$ Happy to help! :) \$\endgroup\$Dave Yarwood– Dave Yarwood2014年04月30日 15:35:13 +00:00Commented Apr 30, 2014 at 15:35
WORDS_ONLY_REGEX
? If so, it's because things like"!'
andi.
andii.
get included as words, when they are not. My intention was to clean up anything that's not alpha, compress all white space into one space, then split into an array. So in a sense, I am doing that. Perhaps I don't understand your question fully, though. \$\endgroup\$\p{Alpha}
would still include stuffi.
andii.
, grabbing them without the period, correct? \$\endgroup\$'
character is not removed, regardless whether it is an apostrophe or not. \$\endgroup\$