I've written a top_n
program in Ruby, which does pretty much exactly what the title says, as part of a coding exercise. I'm trying to learn about how to sort efficiently (i.e. deciding the best sorting algorithm and implementing them or their variations), to manage memory efficiently, and know whether I'm using the right language/constructs/ideas to tackle a sorting problem.
What I'm looking for is advice on the algorithm and the code in general. I feel it can be done MUCH, MUCH better (i.e. I'm doing splits and casting to integers... my intuition tells me something could be better).
I'd really appreciate if someone gave me some pointers on how could I improve this little program. Thanks in advance.
require 'set'
class File
def self.tail(path, n = 10)
result = File.open(path, 'r') do |file|
buffer_size = 512
line_count = 0
file.seek(0, IO::SEEK_END)
offset = file.pos
while line_count <= n && offset > 0
to_read = if (offset - buffer_size) < 0
offset
else
buffer_size
end
file.seek(offset - to_read)
data = file.read(to_read)
data.reverse.each_char do |c|
if line_count > n
offset += 1
break
end
offset -= 1
if c == "\n"
line_count += 1
end
end
end
file.seek(offset)
file.read
end
result
end
def each_chunk(chunk_size)
yield read(chunk_size) until eof?
end
end
def top_n(filename, n = 100)
pre_sorted_chunks = Dir[".#{filename}_sorted_chunk_*"]
if pre_sorted_chunks.empty?
build_pre_sorted_chunks_for(filename)
end
top = SortedSet.new
# This reference takes ±0.141 seconds.
# pre_sorted_chunks.each do |file|
# top << `tail -n #{n} #{file}`.strip.split.map(&:to_i)
# end
# This takes ±0.130 seconds. A little better.
tasks = pre_sorted_chunks.map do |chunk_file_path|
Thread.new(chunk_file_path) do |file_path|
top << File.tail(file_path, n).strip.split.map(&:to_i)
end
end
tasks.each(&:join)
top.max(n)
end
# Current impl. takes ±5m. A lot of time.
def build_pre_sorted_chunks_for(filename)
File.open(filename) do |file|
n = 0
# Chunks of 500MB.
file.each_chunk(1024 ** 2 * 500) do |chunk|
numbers = chunk.split.map(&:to_i)
sorted_set = SortedSet.new(numbers)
sorted_set = sorted_set.to_a.join("\n")
File.open(".#{filename}_sorted_chunk_#{n}", 'w') do |f|
f.write(sorted_set)
end
n += 1
end
end
end
unless ARGV.empty?
filename = ARGV[0]
n = ARGV[1].to_i
puts top_n(filename, n)
else
puts "Please provide a filename and a N value to calculate the top N."
end
I've used this method to create a file with random numbers:
def generate_random_number_file(n = 15_000_000)
require 'set'
randoms = Set.new
loop do
randoms << rand(n)
break if randoms.size >= n
end
File.open('randos.txt', 'w') do |file|
random_list = randoms.to_a.join("\n")
file.write(random_list)
end
end
2 Answers 2
Consider remembering the n largest numbers:
def top_n(filename, n = 100)
require 'set'
topn = SortedSet.new()
File.foreach(filename) do |line|
num = line.to_i
if topn.length < n
topn.add(num)
elsif num > topn.first
topn.delete(topn.first)
topn.add(num)
end
end
return topn.to_a
end
-
\$\begingroup\$ Thanks @NetMage. I wonder how can I stress-test this to see if it's an apt solution for a multi-hundred GB file. \$\endgroup\$Horacio Bertorello– Horacio Bertorello2017年09月01日 19:04:30 +00:00Commented Sep 1, 2017 at 19:04
I wonder if a TopNSet class might be of use here. I'm not sure how it would perform, but it seems like separating the reading of the values from the file, and getting the Top-N are different activities, and a TopNSet might be of value more widely.
I implemented it as:
class TopNSet
def initialize(n)
@n = n
@values = Set.new
end
def << number
values << number
values.delete(values.min) if values.size > n
to_a
end
def to_a
values.to_a
end
private
attr_accessor :values, :n
end
... and hence ...
2.2.5 :021 > t = TopNSet.new(3)
=> #<TopNSet:0x000001145cca28 @n=3, @values=#<Set: {}>>
2.2.5 :022 > t << 1
=> [1]
2.2.5 :023 > t << 2
=> [1, 2]
2.2.5 :024 > t << 3
=> [1, 2, 3]
2.2.5 :025 > t << 4
=> [2, 3, 4]
2.2.5 :026 > t << 4
=> [2, 3, 4]
2.2.5 :027 > t << 3
=> [2, 3, 4]
2.2.5 :028 > t << 7
=> [3, 4, 7]
2.2.5 :029 > t << 6
=> [4, 7, 6]
2.2.5 :030 > t << 1
=> [4, 7, 6]
-
\$\begingroup\$ I used a SortedSet instead of a Set on the theory that inserting a number in the proper position and retrieving the minimum would be more efficient than scanning the entire set for the minimum when a deletion is necessary. \$\endgroup\$NetMage– NetMage2017年09月05日 20:51:42 +00:00Commented Sep 5, 2017 at 20:51
-
\$\begingroup\$ @NetMage Sounds sensible -- benchmarking
Set
vsSortedSet
would be interesting. \$\endgroup\$David Aldridge– David Aldridge2017年09月05日 21:08:20 +00:00Commented Sep 5, 2017 at 21:08
Explore related questions
See similar questions with these tags.
benchmark
module. I'll re-run it and update the benchmarks. Thanks. \$\endgroup\$n
numbers or unique numbers? In other words is it OK if the topn
are all the same 15_000_000? \$\endgroup\$