3
\$\begingroup\$

I have two files.

  1. File 1. Has a list of all the dictionary words
  2. File 2. Has a list of all prepositions.

I want to remove all the prepositions from the dictionary. I want to reduce the number of lines in my code and also make it more elegant, idiomatic and readable.

 #!/usr/bin/env ruby
 path = "/Users/../Desktop/";
 file_original_wordlist = File.open("#{path}" + "dictionary.txt", "r")
 file_remove_wordlist = File.open("#{path}" + "prepositions.txt", "r")
 # Need to initialize the variables else I get errors
 delete_word = false
 word_orig = ''
 word_rem = ''
 count = 0
 file_original_wordlist.each_line do |line1|
 file_remove_wordlist.each_line do |line2|
 word_orig = line1
 word_rem = line2 
 if word_orig.eql?(word_rem)
 puts "Deleting the word " + word_rem
 delete_word = true 
 count++
 end 
 end
 if delete_word == false
 File.open(path + "scrubbed_list.txt", "a") {|f| f.write(word_orig) }
 end
 # Need to reopen the file otherwise after the first iteration to start from the beginning 
 file_remove_wordlist = File.open("#{path}" + "prepositions.txt", "r")
 delete_word = false
 end
 puts "Deleted " + count + " words in total"
asked Feb 22, 2014 at 20:50
\$\endgroup\$
1
  • \$\begingroup\$ I ran some benchmarks that I reported in my answer. \$\endgroup\$ Commented Feb 24, 2014 at 6:26

2 Answers 2

4
\$\begingroup\$

Some notes:

  • I guess you come from imperative languages. Try to write in a more functional style (more expressions, less statements).

  • Use libraries (File) to manipulate paths.

  • This double each_line is bad news for performance: O(n*m). Avoid it by building a data structure that has O(1) checks for inclusion. I'd create a set of the prepositions (it's the smaller set). The overall performance is now O(n).

I'd write:

prepositions = open(File.join(path, "prepositions.txt")).lines.to_a 
words = open(File.join(path, "dictionary.txt")).lines.to_a
filtered_words = words - prepositions
File.write("dictionary_without_prepositions.txt", filtered_words.join)

If the input file dictionary.txt is very, very large, this is a more lazy aproach:

require 'set'
prepositions = open(File.join(path, "prepositions.txt")).lines.to_set
open("dictionary_without_prepositions.txt", "w") do |output| 
 open(File.join(path, "dictionary.txt")).lines.each do |line|
 unless prepositions.include?(line)
 output.write(line)
 end
 end
end
answered Feb 23, 2014 at 19:44
\$\endgroup\$
6
  • 1
    \$\begingroup\$ +1 for the use of to_set for the prepositions - you should expand on why you chose it. \$\endgroup\$ Commented Feb 23, 2014 at 20:05
  • \$\begingroup\$ @UriAgassi. Done! \$\endgroup\$ Commented Feb 23, 2014 at 20:42
  • \$\begingroup\$ Great @tokland. Performance is actually O(n+m), though - you need to build the set... \$\endgroup\$ Commented Feb 24, 2014 at 4:45
  • \$\begingroup\$ @UriAgassi: it's my understanding that O(n+m), m<=n -> O(n) (at worst what would be "O(2n)", but constants are ignored, so O(n)). \$\endgroup\$ Commented Feb 24, 2014 at 9:27
  • \$\begingroup\$ if you know that m<=n, you are right. More generally, though O(n+m)==O(max(n,m)). If you have a big preposition file, and a small dictionary, the preposition file will be the dominant factor in your complexity. \$\endgroup\$ Commented Feb 24, 2014 at 10:01
2
\$\begingroup\$

You can do this in Ruby with very little code. Here's one way:

DICT_FNAME = "#{path}" + "dictionary.txt"
NEW_DICT_FNAME = "#{path}" + "new_dictionary.txt"
PREP_FNAME = "#{path}" + "prepositions.txt"
all_words = File.read(DICT_FNAME).split($/).map(&:strip)
prepositions = File.read(PREP_FNAME).split($/).map(&:strip)
File.write(NEW_DICT_FNAME, (all_words - prepositions).join($/))
puts "#{all_words.size} words in the dictionary"
puts "#{prepositions.size} prepositions to be removed"

Let's try it out. First, write some words to the dictionary file and to the file containing the prepositions:

all = <<_
Now
is
the
time
for
all
Rubyists
to
debug
_
preps = <<_
for
to
_
all #=> "Now\nis\nthe\ntime\nfor\nall\nRubyists\nto\ndebug\n"
preps #=> "for\nto\n"
path = ''
DICT_FNAME = "#{path}" + "dictionary.txt" #=> "dictionary.txt"
PREP_FNAME = "#{path}" + "prepositions.txt" #=> "prepositions.txt"
File.write(DICT_FNAME, all) #=> 42
File.write(PREP_FNAME, preps) #=> 7

Now we read the two input files [$/ is the end-of-line character(s)]:

NEW_DICT_FNAME = "#{path}" + "new_dictionary.txt"
all_words = File.read(DICT_FNAME).split($/).map(&:strip)
 #=> ["Now", "is", "the", "time", "for", "all", "Rubyists", "to", "debug"] 
prepositions = File.read(PREP_FNAME).split($/).map(&:strip)
 #=> ["for", "to"]
puts "#{all_words.size} words in the dictionary"
 #=> 9 words in the dictionary
puts "#{prepositions.size} prepositions to be removed"
 #=> 2 prepositions to be removed

...then remove the elements of the prepositions array from the all_words array:

diff = all_words - prepositions
 #=> ["Now", "is", "the", "time", "all", "Rubyists", "debug"]

...format it for writing:

joined = diff.join($/)
 #=> "Now\nis\nthe\ntime\nall\nRubyists\ndebug"

...write the output file:

File.write(NEW_DICT_FNAME, joined)

...and confirm it it was written correctly:

File.read(NEW_DICT_FNAME).split($/).map(&:strip)
 #=> ["Now", "is", "the", "time", "all", "Rubyists", "debug"] 

Edit: In view of @Tokland's suggestion of constructing a set of prepositions when processing the words one at a time, I thought it might be interesting to run some benchmarks. You'll see that I just used random arrays and words, rather than read and write to files.

L = Array('a'..'z')
require 'set'
def make_samples(n,m,k,s)
 s.times.with_object([]) do |_,a|
 # Construct a sample of n unique words, each of length k
 all_words = make_sample(n,k)
 # Assume the first m words are prepositions
 preps = all_words[0,m]
 # Shuffle the words (no need to further randomize the prepositions) 
 a << [all_words.shuffle, preps]
 end
end 
def make_sample(n,k)
 set_words = Set.new
 while set_words.size < n do
 set_words << k.times.with_object('') { |_,w| w << L.sample }
 end
 set_words.to_a
end

Here is an example of test data with 8 4-character words, of which 3 are prepositions, and a sample size of 2.

make_samples(8,3,4,2)
 #=> [[["fexz", "gxrv", "acte", "namz", "cpqw", "txsm", "zonm", "tvjz"],
 # ["zonm", "gxrv", "fexz"]],
 # [["nfdf", "djnv", "inqk", "tbgc", "asfb", "nqbg", "dnyb", "ywtv"],
 # ["djnv", "tbgc", "inqk"]]]

These are the parameters I used for the results I report below:

n = 200_000 # number of words, inckluding prepositions
m = 40 # number of prepositions
k = 8 # length of each string
s = 20 # sample size
samples = make_samples(n,m,k,s)
Benchmark.bm('reject - arr'.size) do |bm|
 # words array - preps array 
 bm.report 'arr - arr' do
 samples.each { |(wa,pa)| wa - pa }
 end
 # reject words in preps array
 bm.report 'reject - arr' do
 samples.each { |(wa,pa)| wa.reject { |w| pa.include? w } }
 end
 # reject words in preps set
 bm.report 'reject - set' do
 samples.each { |(wa,pa)| ps = pa.to_set; wa.reject { |w| ps.include? w } }
 end
end
 user system total real
arr - arr 0.860000 0.030000 0.890000 ( 0.884945)
reject - arr 10.180000 0.020000 10.200000 ( 10.217233)
reject - set 1.830000 0.040000 1.870000 ( 1.913069)

I ran a few additional tests with different parameters, but these results are indicative.

answered Feb 23, 2014 at 1:32
\$\endgroup\$
1
  • \$\begingroup\$ This would be my first choice if both files fit into memory. \$\endgroup\$ Commented Mar 3, 2014 at 0:36

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.