2
\$\begingroup\$

I've written a top_n program in Ruby, which does pretty much exactly what the title says, as part of a coding exercise. I'm trying to learn about how to sort efficiently (i.e. deciding the best sorting algorithm and implementing them or their variations), to manage memory efficiently, and know whether I'm using the right language/constructs/ideas to tackle a sorting problem.

What I'm looking for is advice on the algorithm and the code in general. I feel it can be done MUCH, MUCH better (i.e. I'm doing splits and casting to integers... my intuition tells me something could be better).

I'd really appreciate if someone gave me some pointers on how could I improve this little program. Thanks in advance.

require 'set'
class File
 def self.tail(path, n = 10)
 result = File.open(path, 'r') do |file|
 buffer_size = 512
 line_count = 0
 file.seek(0, IO::SEEK_END)
 offset = file.pos
 while line_count <= n && offset > 0
 to_read = if (offset - buffer_size) < 0
 offset
 else
 buffer_size
 end
 file.seek(offset - to_read)
 data = file.read(to_read)
 data.reverse.each_char do |c|
 if line_count > n
 offset += 1
 break
 end
 offset -= 1
 if c == "\n"
 line_count += 1
 end
 end
 end
 file.seek(offset)
 file.read
 end
 result
 end
 def each_chunk(chunk_size)
 yield read(chunk_size) until eof?
 end
end
def top_n(filename, n = 100)
 pre_sorted_chunks = Dir[".#{filename}_sorted_chunk_*"]
 if pre_sorted_chunks.empty?
 build_pre_sorted_chunks_for(filename)
 end
 top = SortedSet.new
 # This reference takes ±0.141 seconds.
 # pre_sorted_chunks.each do |file|
 # top << `tail -n #{n} #{file}`.strip.split.map(&:to_i)
 # end
 # This takes ±0.130 seconds. A little better.
 tasks = pre_sorted_chunks.map do |chunk_file_path|
 Thread.new(chunk_file_path) do |file_path|
 top << File.tail(file_path, n).strip.split.map(&:to_i)
 end
 end
 tasks.each(&:join)
 top.max(n)
end
# Current impl. takes ±5m. A lot of time.
def build_pre_sorted_chunks_for(filename)
 File.open(filename) do |file|
 n = 0
 # Chunks of 500MB.
 file.each_chunk(1024 ** 2 * 500) do |chunk|
 numbers = chunk.split.map(&:to_i)
 sorted_set = SortedSet.new(numbers)
 sorted_set = sorted_set.to_a.join("\n")
 File.open(".#{filename}_sorted_chunk_#{n}", 'w') do |f|
 f.write(sorted_set)
 end
 n += 1
 end
 end
end
unless ARGV.empty?
 filename = ARGV[0]
 n = ARGV[1].to_i
 puts top_n(filename, n)
else
 puts "Please provide a filename and a N value to calculate the top N."
end

I've used this method to create a file with random numbers:

def generate_random_number_file(n = 15_000_000)
 require 'set'
 randoms = Set.new
 loop do
 randoms << rand(n)
 break if randoms.size >= n
 end
 File.open('randos.txt', 'w') do |file|
 random_list = randoms.to_a.join("\n")
 file.write(random_list)
 end
end
asked Aug 21, 2017 at 23:24
\$\endgroup\$
8
  • \$\begingroup\$ Is there a reason why you wrote your own File tail, instead of using the 'file-tail' gem? And wrote your own benchmarking code instead of using the conventional Benchmark gem? \$\endgroup\$ Commented Aug 21, 2017 at 23:51
  • \$\begingroup\$ @MarkThomas No particular reason. I just wanted to stick to the core/standard library as much as I could. As for the benchmarking code... I definitely should've used the benchmark module. I'll re-run it and update the benchmarks. Thanks. \$\endgroup\$ Commented Aug 22, 2017 at 0:06
  • \$\begingroup\$ Since we're on Code Review, I must tell you that in the Ruby community it is considered wasteful and even foolish to 'stick to standard libraries' if there exists a gem which does what you want. \$\endgroup\$ Commented Aug 22, 2017 at 0:32
  • \$\begingroup\$ @MarkThomas Yes, I agree entirely. I would never try to reinvent the wheel just to accomplish something already solved efficiently. Unless I'm playing around, trying to learn how things work, which is what I'm doing here right now. \$\endgroup\$ Commented Aug 22, 2017 at 0:45
  • \$\begingroup\$ Is the output the largest n numbers or unique numbers? In other words is it OK if the top n are all the same 15_000_000? \$\endgroup\$ Commented Aug 22, 2017 at 1:55

2 Answers 2

1
\$\begingroup\$

Consider remembering the n largest numbers:

def top_n(filename, n = 100)
 require 'set'
 topn = SortedSet.new()
 File.foreach(filename) do |line|
 num = line.to_i
 if topn.length < n
 topn.add(num)
 elsif num > topn.first
 topn.delete(topn.first)
 topn.add(num)
 end
 end
 return topn.to_a
end
answered Aug 22, 2017 at 23:53
\$\endgroup\$
1
  • \$\begingroup\$ Thanks @NetMage. I wonder how can I stress-test this to see if it's an apt solution for a multi-hundred GB file. \$\endgroup\$ Commented Sep 1, 2017 at 19:04
0
\$\begingroup\$

I wonder if a TopNSet class might be of use here. I'm not sure how it would perform, but it seems like separating the reading of the values from the file, and getting the Top-N are different activities, and a TopNSet might be of value more widely.

I implemented it as:

class TopNSet
 def initialize(n)
 @n = n
 @values = Set.new
 end
 def << number
 values << number
 values.delete(values.min) if values.size > n
 to_a
 end
 def to_a
 values.to_a
 end
 private
 attr_accessor :values, :n
end

... and hence ...

2.2.5 :021 > t = TopNSet.new(3)
 => #<TopNSet:0x000001145cca28 @n=3, @values=#<Set: {}>> 
2.2.5 :022 > t << 1
 => [1] 
2.2.5 :023 > t << 2
 => [1, 2] 
2.2.5 :024 > t << 3
 => [1, 2, 3] 
2.2.5 :025 > t << 4
 => [2, 3, 4] 
2.2.5 :026 > t << 4
 => [2, 3, 4] 
2.2.5 :027 > t << 3
 => [2, 3, 4] 
2.2.5 :028 > t << 7
 => [3, 4, 7] 
2.2.5 :029 > t << 6
 => [4, 7, 6] 
2.2.5 :030 > t << 1
 => [4, 7, 6] 
answered Aug 31, 2017 at 14:05
\$\endgroup\$
2
  • \$\begingroup\$ I used a SortedSet instead of a Set on the theory that inserting a number in the proper position and retrieving the minimum would be more efficient than scanning the entire set for the minimum when a deletion is necessary. \$\endgroup\$ Commented Sep 5, 2017 at 20:51
  • \$\begingroup\$ @NetMage Sounds sensible -- benchmarking Set vs SortedSet would be interesting. \$\endgroup\$ Commented Sep 5, 2017 at 21:08

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.