A Ruby string analyser

Question 1

I wrote this script for a code challenge on TreeHouse. I already submitted my code, but I am looking for an after the fact feedback.

The challenge was to open a file of text, remove specific words (provided in another file) and do the following:

Output to the screen in a readable fashion
1. total word count after filtering
2. highest occurring word
3. longest word(s) and its / their length
4. sort the list from most occurring to least occurring then output that data to the screen as an unordered list
5. Bonus: Return the word list in JSON.

I decided to split the solution into two classes:

One class to sanitise the string into words only
Another class to do the analysis

This is the sanitiser class. Given text it should return a string of words only. It should remove anything that is not a letter or an apostrophe, and compress all white space into a single space character.

I accept that this has problems. For example, I found an instance of an apostrophe as a word. Perhaps it was a leading or a trailing one. But defining the boundaries of a word is a tricky proposition.

class WordsOnlySanitizer
 # Allow for diacritics, hence p{Alpha} and not \w
 # We should not split words on apostrophes either
 WORDS_ONLY_REGEX = /[^\p{Alpha}']/i
 # We want to reduce all white space into a single space
 SPACE_ONLY_REGEX = /\s+/
 def self.to_words(text)
 text.gsub(WORDS_ONLY_REGEX, ' ').gsub(SPACE_ONLY_REGEX, ' ')
 end
end

This is the analyser class. It's self explanatory. There is duplication in the longest_words and highest_occuring_words methods. But I'm not sure how to remove this duplication without making the code less readable.

The html_list method also looks a little suspect, but I can't tell why.

require 'json'
class Analyser
 def initialize(text, filter)
 @words = text.split
 @filter = filter.split
 end
 def word_count
 filtered_words.size
 end
 def word_occurrences
 @word_occurrences ||= filtered_words.inject(Hash.new(0)) do |result, word|
 result[word] += 1
 result
 end
 end
 def highest_occurring_words
 word_occurrences.group_by { |key, value| value }.max_by { |key, value| key }.last
 end
 def longest_words
 filtered_words.inject({}) do |result, word|
 result[word] = word.length
 result
 end.group_by { |key, value| value }.max_by { |key, value| key }.last
 end
 def html_list
 list = ""
 word_occurrences.sort_by { |key, value| value }.reverse.each do |key, value|
 list << " <li>#{key}: #{value}</li>\n"
 end
 "<ul>\n" + list + "</ul>"
 end
 def json_list
 JSON.parse(word_occurrences.to_json)
 end
private
 def filtered_words
 @filtered_words ||= @words.reject do |word|
 # Downcase so that Hello and hello count as two occurrences
 word.downcase!
 @filter.include?(word)
 end
 end
end

Usage

Here's how you would use this:

text = WordsOnlySanitizer.to_words(File.read('words.txt'))
filter = WordsOnlySanitizer.to_words(File.read('filter_words.txt'))
analyser = Analyser.new(text, filter)
puts "Word count after filtering is: #{analyser.word_count}"
puts "\n"
puts "The most frequent words are:"
analyser.highest_occurring_words.each do |key, value|
 puts " - #{key}: #{value} occurences"
end
puts "\n"
puts "The longest words are:"
analyser.longest_words.each do |word|
 puts " - #{word.first}: #{word.last} characters"
end
puts "\n"
puts "Word list:"
puts analyser.html_list
puts "JSON object:"
puts analyser.json_list

Here's a gist with all the files. Warning: There are large text files.

Question 2

Just curious, what is the problem with defining word boundaries by simple whitespace?

Question 3

@Jonah Are you referring to my WORDS_ONLY_REGEX? If so, it's because things like "!' and i. and ii. get included as words, when they are not. My intention was to clean up anything that's not alpha, compress all white space into one space, then split into an array. So in a sense, I am doing that. Perhaps I don't understand your question fully, though.

Question 4

Got it, that makes sense. However \p{Alpha} would still include stuff i. and ii., grabbing them without the period, correct?

Question 5

@Jonah yes, and I suppose you can consider that a flaw. This is why I mentioned that defining word boundaries is tricky. My regex also does not account for non-word-apostrophes. So any ' character is not removed, regardless whether it is an apostrophe or not.

Question 6

@Jonah here's a Rubular to illustrate this behaviour. rubular.com/r/BpTzjvo2GU

Question 7

The first thing I noticed is that you've defined methods such as word_occurrences, which do nothing but set instance variables (@word_occurrences, etc.). This is a bit redundant, and it would be better to do one of two things:

Define all of your instance variables under the initialize method. or
Take out the @word_occurrences ||= part of the method definition.

I think from a performance standpoint, it might be better to take the first approach, since you don't want to have to calculate word_occurrences every time you call it as a method. I also noticed that you only use @words and @filter once, and that's to define filtered_words, so you can make things simple by just defining @words to be what filtered_words is currently. See below:

def initialize(text, filter)
 @words = text.split.reject {|word| filter.split.include? word.downcase}
 @occurrences = @words.each_with_object(Hash.new(0)) {|word, result| result[word] += 1}
 @lengths = @words.each_with_object({}) {|word, result| result[word] ||= word.length}
end

I've made @words the new filtered_words, shortened word_occurrences to just @occurrences (and made it an instance variable so that we're only evaluating it once, upon initialization of the instance), and added another instance variable @lengths, which represents a part of your longest_words function, a hash of each word and how long it is.

Also, notice that I simplified your use of inject to create a hash of word occurrences and word lengths, by using the handy each_with_object function instead.

I also changed = to ||= for the part where you're building up a hash of word lengths. I believe this will be better for performance, as you're not re-calculating the word length for words you already have.

You mentioned duplication between longest_words and highest_occurring_words -- you could solve this by pulling out the part you reused into a private function, which I would call highest_ranking:

private
def highest_ranking(entries)
 # Takes a hashmap of the form {"foo" => 100, "bar" => 45} and returns an array
 # containing the entries (array-ified) with the highest number as a value.
 entries.group_by{|word, occs| occs}.sort.last.last
end

I simplified max_by{|key, value| key}.last to just sort.last.last -- it turns out that Enumerable#sort knows how to properly handle sorting by keys, so getting the highest one is just a matter of sorting the hash and grabbing the last item in the resulting array. (The extra last is there for the same reason the last is there in your function -- to drop the number of occurrences and just return the key/value entries in the result array)

Now that you have this useful helper function, finding the highest occurring words and longest words is simple:

public
def highest_occurring_words
 highest_ranking @occurrences
end
def longest_words
 highest_ranking @lengths
end

Finally, I would refactor your html_list function like this:

def html_list
 list = word_occurrences.sort_by {|word, occs| occs}.reverse.map do |word, occs|
 " <li>#{word}: #{occs}</li>"
 end.join("\n")
 "<ul>\n" + list + "\n</ul>"
end

Instead of starting with list as "" and iterating through each word/occurrences pair to add onto it a little at a time, you can take a functional approach and use map to turn each word/occurrences pair into a string like " <li>antibubbles: 4</li>", then use join to turn the resulting array into one long string with "\n" inserted in between each list item. I also inserted two spaces before each list item so that they appear indented in the HTML. :)

Question 8

Thank you for the insightful answer. At further glance, do you think that the initialize method is doing too much? How would you test something like that? I feels hard to test, to me at least.

Question 9

I've bundled this into a gem, taking into account some of your improvements. github.com/abitdodgy/words_counted Thanks for your input.

Question 10

Your initialize method should set up your instance variables -- in this case, @words, @lengths and @occurrences -- and very little else, if anything. A lot of the time you're just passing in arguments, like @name = name; @age = age, etc. I think this is a good example of a case where you can do something more complex within the initialize method, while still keeping your code concise and readable.

Question 11

Happy to help! :)

Dave Yarwood Dave Yarwood 1,1116 silver badges13 bronze badges · Accepted Answer · 2014-04-17 19:53:57Z

The first thing I noticed is that you've defined methods such as word_occurrences, which do nothing but set instance variables (@word_occurrences, etc.). This is a bit redundant, and it would be better to do one of two things:

Define all of your instance variables under the initialize method. or
Take out the @word_occurrences ||= part of the method definition.

I think from a performance standpoint, it might be better to take the first approach, since you don't want to have to calculate word_occurrences every time you call it as a method. I also noticed that you only use @words and @filter once, and that's to define filtered_words, so you can make things simple by just defining @words to be what filtered_words is currently. See below:

def initialize(text, filter)
 @words = text.split.reject {|word| filter.split.include? word.downcase}
 @occurrences = @words.each_with_object(Hash.new(0)) {|word, result| result[word] += 1}
 @lengths = @words.each_with_object({}) {|word, result| result[word] ||= word.length}
end

I've made @words the new filtered_words, shortened word_occurrences to just @occurrences (and made it an instance variable so that we're only evaluating it once, upon initialization of the instance), and added another instance variable @lengths, which represents a part of your longest_words function, a hash of each word and how long it is.

Also, notice that I simplified your use of inject to create a hash of word occurrences and word lengths, by using the handy each_with_object function instead.

I also changed = to ||= for the part where you're building up a hash of word lengths. I believe this will be better for performance, as you're not re-calculating the word length for words you already have.

You mentioned duplication between longest_words and highest_occurring_words -- you could solve this by pulling out the part you reused into a private function, which I would call highest_ranking:

private
def highest_ranking(entries)
 # Takes a hashmap of the form {"foo" => 100, "bar" => 45} and returns an array
 # containing the entries (array-ified) with the highest number as a value.
 entries.group_by{|word, occs| occs}.sort.last.last
end

I simplified max_by{|key, value| key}.last to just sort.last.last -- it turns out that Enumerable#sort knows how to properly handle sorting by keys, so getting the highest one is just a matter of sorting the hash and grabbing the last item in the resulting array. (The extra last is there for the same reason the last is there in your function -- to drop the number of occurrences and just return the key/value entries in the result array)

Now that you have this useful helper function, finding the highest occurring words and longest words is simple:

public
def highest_occurring_words
 highest_ranking @occurrences
end
def longest_words
 highest_ranking @lengths
end

Finally, I would refactor your html_list function like this:

def html_list
 list = word_occurrences.sort_by {|word, occs| occs}.reverse.map do |word, occs|
 " <li>#{word}: #{occs}</li>"
 end.join("\n")
 "<ul>\n" + list + "\n</ul>"
end

Instead of starting with list as "" and iterating through each word/occurrences pair to add onto it a little at a time, you can take a functional approach and use map to turn each word/occurrences pair into a string like " <li>antibubbles: 4</li>", then use join to turn the resulting array into one long string with "\n" inserted in between each list item. I also inserted two spaces before each list item so that they appear indented in the HTML. :)

Thank you for the insightful answer. At further glance, do you think that the initialize method is doing too much? How would you test something like that? I feels hard to test, to me at least.
I've bundled this into a gem, taking into account some of your improvements. github.com/abitdodgy/words_counted Thanks for your input.
Your initialize method should set up your instance variables -- in this case, @words, @lengths and @occurrences -- and very little else, if anything. A lot of the time you're just passing in arguments, like @name = name; @age = age, etc. I think this is a good example of a case where you can do something more complex within the initialize method, while still keeping your code concise and readable.

Stack Exchange Network

A Ruby string analyser

Usage

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

A Ruby string analyser

Usage

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions