Given an arbitrarily large file containing integers, return the largest N numbers, highest first

Question 1

I've written a top_n program in Ruby, which does pretty much exactly what the title says, as part of a coding exercise. I'm trying to learn about how to sort efficiently (i.e. deciding the best sorting algorithm and implementing them or their variations), to manage memory efficiently, and know whether I'm using the right language/constructs/ideas to tackle a sorting problem.

What I'm looking for is advice on the algorithm and the code in general. I feel it can be done MUCH, MUCH better (i.e. I'm doing splits and casting to integers... my intuition tells me something could be better).

I'd really appreciate if someone gave me some pointers on how could I improve this little program. Thanks in advance.

require 'set'
class File
 def self.tail(path, n = 10)
 result = File.open(path, 'r') do |file|
 buffer_size = 512
 line_count = 0
 file.seek(0, IO::SEEK_END)
 offset = file.pos
 while line_count <= n && offset > 0
 to_read = if (offset - buffer_size) < 0
 offset
 else
 buffer_size
 end
 file.seek(offset - to_read)
 data = file.read(to_read)
 data.reverse.each_char do |c|
 if line_count > n
 offset += 1
 break
 end
 offset -= 1
 if c == "\n"
 line_count += 1
 end
 end
 end
 file.seek(offset)
 file.read
 end
 result
 end
 def each_chunk(chunk_size)
 yield read(chunk_size) until eof?
 end
end
def top_n(filename, n = 100)
 pre_sorted_chunks = Dir[".#{filename}_sorted_chunk_*"]
 if pre_sorted_chunks.empty?
 build_pre_sorted_chunks_for(filename)
 end
 top = SortedSet.new
 # This reference takes ±0.141 seconds.
 # pre_sorted_chunks.each do |file|
 # top << `tail -n #{n} #{file}`.strip.split.map(&:to_i)
 # end
 # This takes ±0.130 seconds. A little better.
 tasks = pre_sorted_chunks.map do |chunk_file_path|
 Thread.new(chunk_file_path) do |file_path|
 top << File.tail(file_path, n).strip.split.map(&:to_i)
 end
 end
 tasks.each(&:join)
 top.max(n)
end
# Current impl. takes ±5m. A lot of time.
def build_pre_sorted_chunks_for(filename)
 File.open(filename) do |file|
 n = 0
 # Chunks of 500MB.
 file.each_chunk(1024 ** 2 * 500) do |chunk|
 numbers = chunk.split.map(&:to_i)
 sorted_set = SortedSet.new(numbers)
 sorted_set = sorted_set.to_a.join("\n")
 File.open(".#{filename}_sorted_chunk_#{n}", 'w') do |f|
 f.write(sorted_set)
 end
 n += 1
 end
 end
end
unless ARGV.empty?
 filename = ARGV[0]
 n = ARGV[1].to_i
 puts top_n(filename, n)
else
 puts "Please provide a filename and a N value to calculate the top N."
end

I've used this method to create a file with random numbers:

def generate_random_number_file(n = 15_000_000)
 require 'set'
 randoms = Set.new
 loop do
 randoms << rand(n)
 break if randoms.size >= n
 end
 File.open('randos.txt', 'w') do |file|
 random_list = randoms.to_a.join("\n")
 file.write(random_list)
 end
end

Question 2

Is there a reason why you wrote your own File tail, instead of using the 'file-tail' gem? And wrote your own benchmarking code instead of using the conventional Benchmark gem?

Question 3

@MarkThomas No particular reason. I just wanted to stick to the core/standard library as much as I could. As for the benchmarking code... I definitely should've used the benchmark module. I'll re-run it and update the benchmarks. Thanks.

Question 4

Since we're on Code Review, I must tell you that in the Ruby community it is considered wasteful and even foolish to 'stick to standard libraries' if there exists a gem which does what you want.

Question 5

@MarkThomas Yes, I agree entirely. I would never try to reinvent the wheel just to accomplish something already solved efficiently. Unless I'm playing around, trying to learn how things work, which is what I'm doing here right now.

Question 6

Is the output the largest n numbers or unique numbers? In other words is it OK if the top n are all the same 15_000_000?

Question 7

Consider remembering the n largest numbers:

def top_n(filename, n = 100)
 require 'set'
 topn = SortedSet.new()
 File.foreach(filename) do |line|
 num = line.to_i
 if topn.length < n
 topn.add(num)
 elsif num > topn.first
 topn.delete(topn.first)
 topn.add(num)
 end
 end
 return topn.to_a
end

Question 8

Thanks @NetMage. I wonder how can I stress-test this to see if it's an apt solution for a multi-hundred GB file.

Question 9

I wonder if a TopNSet class might be of use here. I'm not sure how it would perform, but it seems like separating the reading of the values from the file, and getting the Top-N are different activities, and a TopNSet might be of value more widely.

I implemented it as:

class TopNSet
 def initialize(n)
 @n = n
 @values = Set.new
 end
 def << number
 values << number
 values.delete(values.min) if values.size > n
 to_a
 end
 def to_a
 values.to_a
 end
 private
 attr_accessor :values, :n
end

... and hence ...

2.2.5 :021 > t = TopNSet.new(3)
 => #<TopNSet:0x000001145cca28 @n=3, @values=#<Set: {}>> 
2.2.5 :022 > t << 1
 => [1] 
2.2.5 :023 > t << 2
 => [1, 2] 
2.2.5 :024 > t << 3
 => [1, 2, 3] 
2.2.5 :025 > t << 4
 => [2, 3, 4] 
2.2.5 :026 > t << 4
 => [2, 3, 4] 
2.2.5 :027 > t << 3
 => [2, 3, 4] 
2.2.5 :028 > t << 7
 => [3, 4, 7] 
2.2.5 :029 > t << 6
 => [4, 7, 6] 
2.2.5 :030 > t << 1
 => [4, 7, 6]

Question 10

I used a SortedSet instead of a Set on the theory that inserting a number in the proper position and retrieving the minimum would be more efficient than scanning the entire set for the minimum when a deletion is necessary.

Question 11

@NetMage Sounds sensible -- benchmarking Set vs SortedSet would be interesting.

NetMage NetMage 4222 silver badges12 bronze badges · Answer 1 · 2017-08-22 23:53:12Z

Consider remembering the n largest numbers:

def top_n(filename, n = 100)
 require 'set'
 topn = SortedSet.new()
 File.foreach(filename) do |line|
 num = line.to_i
 if topn.length < n
 topn.add(num)
 elsif num > topn.first
 topn.delete(topn.first)
 topn.add(num)
 end
 end
 return topn.to_a
end

Thanks @NetMage. I wonder how can I stress-test this to see if it's an apt solution for a multi-hundred GB file.

David Aldridge David Aldridge 5353 silver badges10 bronze badges · Answer 2 · 2017-08-31 14:05:33Z

I wonder if a TopNSet class might be of use here. I'm not sure how it would perform, but it seems like separating the reading of the values from the file, and getting the Top-N are different activities, and a TopNSet might be of value more widely.

I implemented it as:

class TopNSet
 def initialize(n)
 @n = n
 @values = Set.new
 end
 def << number
 values << number
 values.delete(values.min) if values.size > n
 to_a
 end
 def to_a
 values.to_a
 end
 private
 attr_accessor :values, :n
end

... and hence ...

2.2.5 :021 > t = TopNSet.new(3)
 => #<TopNSet:0x000001145cca28 @n=3, @values=#<Set: {}>> 
2.2.5 :022 > t << 1
 => [1] 
2.2.5 :023 > t << 2
 => [1, 2] 
2.2.5 :024 > t << 3
 => [1, 2, 3] 
2.2.5 :025 > t << 4
 => [2, 3, 4] 
2.2.5 :026 > t << 4
 => [2, 3, 4] 
2.2.5 :027 > t << 3
 => [2, 3, 4] 
2.2.5 :028 > t << 7
 => [3, 4, 7] 
2.2.5 :029 > t << 6
 => [4, 7, 6] 
2.2.5 :030 > t << 1
 => [4, 7, 6]

I used a SortedSet instead of a Set on the theory that inserting a number in the proper position and retrieving the minimum would be more efficient than scanning the entire set for the minimum when a deletion is necessary.
@NetMage Sounds sensible -- benchmarking Set vs SortedSet would be interesting.

Stack Exchange Network

Given an arbitrarily large file containing integers, return the largest N numbers, highest first

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Given an arbitrarily large file containing integers, return the largest N numbers, highest first

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions