Removing list of words from a text file in Ruby

Question 1

I have two files.

File 1. Has a list of all the dictionary words
File 2. Has a list of all prepositions.

I want to remove all the prepositions from the dictionary. I want to reduce the number of lines in my code and also make it more elegant, idiomatic and readable.

 #!/usr/bin/env ruby
 path = "/Users/../Desktop/";
 file_original_wordlist = File.open("#{path}" + "dictionary.txt", "r")
 file_remove_wordlist = File.open("#{path}" + "prepositions.txt", "r")
 # Need to initialize the variables else I get errors
 delete_word = false
 word_orig = ''
 word_rem = ''
 count = 0
 file_original_wordlist.each_line do |line1|
 file_remove_wordlist.each_line do |line2|
 word_orig = line1
 word_rem = line2 
 if word_orig.eql?(word_rem)
 puts "Deleting the word " + word_rem
 delete_word = true 
 count++
 end 
 end
 if delete_word == false
 File.open(path + "scrubbed_list.txt", "a") {|f| f.write(word_orig) }
 end
 # Need to reopen the file otherwise after the first iteration to start from the beginning 
 file_remove_wordlist = File.open("#{path}" + "prepositions.txt", "r")
 delete_word = false
 end
 puts "Deleted " + count + " words in total"

Question 2

I ran some benchmarks that I reported in my answer.

Question 3

Some notes:

I guess you come from imperative languages. Try to write in a more functional style (more expressions, less statements).
Use libraries (File) to manipulate paths.
This double each_line is bad news for performance: O(n*m). Avoid it by building a data structure that has O(1) checks for inclusion. I'd create a set of the prepositions (it's the smaller set). The overall performance is now O(n).

I'd write:

prepositions = open(File.join(path, "prepositions.txt")).lines.to_a 
words = open(File.join(path, "dictionary.txt")).lines.to_a
filtered_words = words - prepositions
File.write("dictionary_without_prepositions.txt", filtered_words.join)

If the input file dictionary.txt is very, very large, this is a more lazy aproach:

require 'set'
prepositions = open(File.join(path, "prepositions.txt")).lines.to_set
open("dictionary_without_prepositions.txt", "w") do |output| 
 open(File.join(path, "dictionary.txt")).lines.each do |line|
 unless prepositions.include?(line)
 output.write(line)
 end
 end
end

Question 4

+1 for the use of to_set for the prepositions - you should expand on why you chose it.

Question 5

@UriAgassi. Done!

Question 6

Great @tokland. Performance is actually O(n+m), though - you need to build the set...

Question 7

@UriAgassi: it's my understanding that O(n+m), m<=n -> O(n) (at worst what would be "O(2n)", but constants are ignored, so O(n)).

Question 8

if you know that m<=n, you are right. More generally, though O(n+m)==O(max(n,m)). If you have a big preposition file, and a small dictionary, the preposition file will be the dominant factor in your complexity.

Question 9

You can do this in Ruby with very little code. Here's one way:

DICT_FNAME = "#{path}" + "dictionary.txt"
NEW_DICT_FNAME = "#{path}" + "new_dictionary.txt"
PREP_FNAME = "#{path}" + "prepositions.txt"
all_words = File.read(DICT_FNAME).split($/).map(&:strip)
prepositions = File.read(PREP_FNAME).split($/).map(&:strip)
File.write(NEW_DICT_FNAME, (all_words - prepositions).join($/))
puts "#{all_words.size} words in the dictionary"
puts "#{prepositions.size} prepositions to be removed"

Let's try it out. First, write some words to the dictionary file and to the file containing the prepositions:

all = <<_
Now
is
the
time
for
all
Rubyists
to
debug
_
preps = <<_
for
to
_
all #=> "Now\nis\nthe\ntime\nfor\nall\nRubyists\nto\ndebug\n"
preps #=> "for\nto\n"
path = ''
DICT_FNAME = "#{path}" + "dictionary.txt" #=> "dictionary.txt"
PREP_FNAME = "#{path}" + "prepositions.txt" #=> "prepositions.txt"
File.write(DICT_FNAME, all) #=> 42
File.write(PREP_FNAME, preps) #=> 7

Now we read the two input files [$/ is the end-of-line character(s)]:

NEW_DICT_FNAME = "#{path}" + "new_dictionary.txt"
all_words = File.read(DICT_FNAME).split($/).map(&:strip)
 #=> ["Now", "is", "the", "time", "for", "all", "Rubyists", "to", "debug"] 
prepositions = File.read(PREP_FNAME).split($/).map(&:strip)
 #=> ["for", "to"]
puts "#{all_words.size} words in the dictionary"
 #=> 9 words in the dictionary
puts "#{prepositions.size} prepositions to be removed"
 #=> 2 prepositions to be removed

...then remove the elements of the prepositions array from the all_words array:

diff = all_words - prepositions
 #=> ["Now", "is", "the", "time", "all", "Rubyists", "debug"]

...format it for writing:

joined = diff.join($/)
 #=> "Now\nis\nthe\ntime\nall\nRubyists\ndebug"

...write the output file:

File.write(NEW_DICT_FNAME, joined)

...and confirm it it was written correctly:

File.read(NEW_DICT_FNAME).split($/).map(&:strip)
 #=> ["Now", "is", "the", "time", "all", "Rubyists", "debug"]

Edit: In view of @Tokland's suggestion of constructing a set of prepositions when processing the words one at a time, I thought it might be interesting to run some benchmarks. You'll see that I just used random arrays and words, rather than read and write to files.

L = Array('a'..'z')
require 'set'
def make_samples(n,m,k,s)
 s.times.with_object([]) do |_,a|
 # Construct a sample of n unique words, each of length k
 all_words = make_sample(n,k)
 # Assume the first m words are prepositions
 preps = all_words[0,m]
 # Shuffle the words (no need to further randomize the prepositions) 
 a << [all_words.shuffle, preps]
 end
end 
def make_sample(n,k)
 set_words = Set.new
 while set_words.size < n do
 set_words << k.times.with_object('') { |_,w| w << L.sample }
 end
 set_words.to_a
end

Here is an example of test data with 8 4-character words, of which 3 are prepositions, and a sample size of 2.

make_samples(8,3,4,2)
 #=> [[["fexz", "gxrv", "acte", "namz", "cpqw", "txsm", "zonm", "tvjz"],
 # ["zonm", "gxrv", "fexz"]],
 # [["nfdf", "djnv", "inqk", "tbgc", "asfb", "nqbg", "dnyb", "ywtv"],
 # ["djnv", "tbgc", "inqk"]]]

These are the parameters I used for the results I report below:

n = 200_000 # number of words, inckluding prepositions
m = 40 # number of prepositions
k = 8 # length of each string
s = 20 # sample size
samples = make_samples(n,m,k,s)
Benchmark.bm('reject - arr'.size) do |bm|
 # words array - preps array 
 bm.report 'arr - arr' do
 samples.each { |(wa,pa)| wa - pa }
 end
 # reject words in preps array
 bm.report 'reject - arr' do
 samples.each { |(wa,pa)| wa.reject { |w| pa.include? w } }
 end
 # reject words in preps set
 bm.report 'reject - set' do
 samples.each { |(wa,pa)| ps = pa.to_set; wa.reject { |w| ps.include? w } }
 end
end
 user system total real
arr - arr 0.860000 0.030000 0.890000 ( 0.884945)
reject - arr 10.180000 0.020000 10.200000 ( 10.217233)
reject - set 1.830000 0.040000 1.870000 ( 1.913069)

I ran a few additional tests with different parameters, but these results are indicative.

Question 10

This would be my first choice if both files fit into memory.

tokland tokland 11.2k1 gold badge21 silver badges26 bronze badges · Accepted Answer · 2014-02-23 19:44:28Z

Some notes:

I guess you come from imperative languages. Try to write in a more functional style (more expressions, less statements).
Use libraries (File) to manipulate paths.
This double each_line is bad news for performance: O(n*m). Avoid it by building a data structure that has O(1) checks for inclusion. I'd create a set of the prepositions (it's the smaller set). The overall performance is now O(n).

I'd write:

prepositions = open(File.join(path, "prepositions.txt")).lines.to_a 
words = open(File.join(path, "dictionary.txt")).lines.to_a
filtered_words = words - prepositions
File.write("dictionary_without_prepositions.txt", filtered_words.join)

If the input file dictionary.txt is very, very large, this is a more lazy aproach:

require 'set'
prepositions = open(File.join(path, "prepositions.txt")).lines.to_set
open("dictionary_without_prepositions.txt", "w") do |output| 
 open(File.join(path, "dictionary.txt")).lines.each do |line|
 unless prepositions.include?(line)
 output.write(line)
 end
 end
end

+1 for the use of to_set for the prepositions - you should expand on why you chose it.
Great @tokland. Performance is actually O(n+m), though - you need to build the set...
@UriAgassi: it's my understanding that O(n+m), m<=n -> O(n) (at worst what would be "O(2n)", but constants are ignored, so O(n)).
if you know that m<=n, you are right. More generally, though O(n+m)==O(max(n,m)). If you have a big preposition file, and a small dictionary, the preposition file will be the dominant factor in your complexity.

Stack Exchange Network

Removing list of words from a text file in Ruby

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Removing list of words from a text file in Ruby

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions