Prefix Sum in Ruby, Genomic Range Query from Codility

Question 1

I'm currently going through some lessons on Codility. I've just spent a couple of hours with GenomicRangeQuery, which is intended to demonstrate the use of prefix sums.

The task description is here. Broadly, the idea is that given a string and a set of arbitrary cut points one should be able to return the lowest value character in a given slice. O(N*M) complexity is what we want to avoid.

My solution scores 100%, but I would still value any feedback on style and readability.

def solution(str, slice_starts, slice_ends)
 # Initiate an array to hold rolling totals of character occurrences
 prefix_sums = [0,0,0,0]]
 # Map possible characters to their positions in the prefix sum array
 chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
 str.split('').each_with_index do |char, i|
 prefix_sums[i + 1] = prefix_sums[i].clone
 prefix_sums[i + 1][chars[char]] += 1
 end
 slice_starts.each_with_index.map do |slice_start, i|
 s = slice_start
 e = slice_ends[i]
 chars.each do |char, ii|
 occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
 break (ii + 1) if occurrences > 0
 end
 end
end

Updated

Here's my current preferred version using tips from some of the answers below. I should probably also use an array of hashes, rather than arrays, for the prefix sums, but I'm lazy.

def solution(str, slice_starts, slice_ends)
 prefix_sums = [[0] * 4]
 chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
 str.chars.each_with_index do |char, i|
 prefix_sums[i + 1] = prefix_sums[i].clone
 prefix_sums[i + 1][chars[char]] += 1
 end
 slice_starts.zip(slice_ends).map do |s, e|
 chars.each do |char, ii|
 occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
 break (ii + 1) if occurrences > 0
 end
 end
end

Question 2

Some notes:

Use 2-space indentation.
[[0,0,0,0]]. Always a space after a comma.
str.split('') ->str.chars
prefix_sums = [[0,0,0,0]]. This kind of implicit structures hinder readability, I'd simply use a hash. A small space penalty (not in big-Oh, though), better readibility.
each_with_index. Use zip instead.
As you say, the problem is about getting partial sums (frequencies). You can implement and use a very generic abstraction (Enumerable#scanl), which can be used in more codility problems.

I'd write this purely functional solution:

def solution(input_str, slice_start, slice_end)
 impact_factors = {"A" => 1, "C" => 2, "G" => 3, "T" => 4}
 frequencies = input_str.chars.scanl(Hash.new(0)) do |freq, n|
 freq.merge(n => freq[n] + 1)
 end
 slice_start.zip(slice_end).map do |from, to|
 difference_count_by_factor = frequencies[to+1].map do |n, count| 
 [impact_factors[n], count - frequencies[from][nucleotide]]
 end.to_h
 difference_count_by_factor.reject { |factor, count| count.zero? }.keys.min
 end
end

Question 3

This is a great answer, thanks. My only reservation is implementing Enumerable#scanl might confuse other Rubyists, especially those like me without a functional background. I like your solution, but from my experience it perhaps doesn't feel like the most idiomatic Ruby.

Question 4

Here's a take on the algorithm you used that takes advantage of more ruby sugar. Explanation and comments are inline:

def solution(s, p, q)
 impacts = {'A'=>1, 'C'=>2, 'G'=>3, 'T'=>4}
 initial = {'A'=>0, 'C'=>0, 'G'=>0, 'T'=>0}
 # Same as your prefix_sums
 # But using hashes to store the nucleotide counts at each index
 counts_at = s.chars.each_with_object({}).with_index do |(c,m), i|
 m[i] = (m[i-1] || initial).clone.tap {|x| x.update(c => x[c] + 1)}
 end
 # Find the necleotides guaranteed to exist in each subsequence
 # Transform them into impacts, and choose the min
 p.zip(q).map do |from, to|
 impacts.keys.reject {|dna| counts_at[from][dna] == counts_at[to][dna]} # dna keys whose count has changed must appear
 .concat([s[from]]) # the first point must appear too
 .map {|dna| impacts[dna]} # turn it into impacts
 .min # select the min
 end
end
s = 'CAGCCTA'
p = [2,5,0]
q = [4,5,6]
p solution(s, p, q) #=> [2, 4, 1]

Question 5

Is #tap necessary in the prefix sum calculation step? I think counts_at may be a better variable name. And I like using hashes for the counts. I think I prefer my method of breaking early out of the #zip though. Fractionally faster and no less readable.

Question 6

You can avoid tap if you do it in two lines, where the first line ends after clone, and the 2nd line is m[i][c] += 1, which many might consider more readable, though it's a more "procedural" style. re break, you're right it's slightly faster, but i think it is slightly less readable vs a purely functional, declarative solution, but that's partly a matter of taste.

Question 7

I do like xrthdsf's solution a lot too.

Question 8

def solution(str, slice_starts, slice_ends)
 scores = { 'A' => 1, 'C' => 2, 'G' => 3, 'T' => 4 }
 occurences = scores.keys.map { |k| [k, [0]*(str.size+1)] }.to_h
 str.chars.each_with_index do |c, i| 
 scores.keys.each { |n| occurences[n][i+1] = occurences[n][i] + ((n == c) ? 1 : 0) }
 end
 slice_starts.zip(slice_ends).map do |from, to|
 scores.keys.each { |n| break(scores[n]) if occurences[n][to+1] - occurences[n][from] > 0 }
 end
end

By the way, I'd use double quotes only in string interpolations.

tokland tokland 11.2k1 gold badge21 silver badges26 bronze badges · Answer 1 · 2015-08-19 09:45:31Z

Some notes:

Use 2-space indentation.
[[0,0,0,0]]. Always a space after a comma.
str.split('') ->str.chars
prefix_sums = [[0,0,0,0]]. This kind of implicit structures hinder readability, I'd simply use a hash. A small space penalty (not in big-Oh, though), better readibility.
each_with_index. Use zip instead.
As you say, the problem is about getting partial sums (frequencies). You can implement and use a very generic abstraction (Enumerable#scanl), which can be used in more codility problems.

I'd write this purely functional solution:

def solution(input_str, slice_start, slice_end)
 impact_factors = {"A" => 1, "C" => 2, "G" => 3, "T" => 4}
 frequencies = input_str.chars.scanl(Hash.new(0)) do |freq, n|
 freq.merge(n => freq[n] + 1)
 end
 slice_start.zip(slice_end).map do |from, to|
 difference_count_by_factor = frequencies[to+1].map do |n, count| 
 [impact_factors[n], count - frequencies[from][nucleotide]]
 end.to_h
 difference_count_by_factor.reject { |factor, count| count.zero? }.keys.min
 end
end

This is a great answer, thanks. My only reservation is implementing Enumerable#scanl might confuse other Rubyists, especially those like me without a functional background. I like your solution, but from my experience it perhaps doesn't feel like the most idiomatic Ruby.

Jonah Jonah 4,39015 silver badges23 bronze badges · Answer 2 · 2015-10-25 05:07:28Z

Here's a take on the algorithm you used that takes advantage of more ruby sugar. Explanation and comments are inline:

def solution(s, p, q)
 impacts = {'A'=>1, 'C'=>2, 'G'=>3, 'T'=>4}
 initial = {'A'=>0, 'C'=>0, 'G'=>0, 'T'=>0}
 # Same as your prefix_sums
 # But using hashes to store the nucleotide counts at each index
 counts_at = s.chars.each_with_object({}).with_index do |(c,m), i|
 m[i] = (m[i-1] || initial).clone.tap {|x| x.update(c => x[c] + 1)}
 end
 # Find the necleotides guaranteed to exist in each subsequence
 # Transform them into impacts, and choose the min
 p.zip(q).map do |from, to|
 impacts.keys.reject {|dna| counts_at[from][dna] == counts_at[to][dna]} # dna keys whose count has changed must appear
 .concat([s[from]]) # the first point must appear too
 .map {|dna| impacts[dna]} # turn it into impacts
 .min # select the min
 end
end
s = 'CAGCCTA'
p = [2,5,0]
q = [4,5,6]
p solution(s, p, q) #=> [2, 4, 1]

Is #tap necessary in the prefix sum calculation step? I think counts_at may be a better variable name. And I like using hashes for the counts. I think I prefer my method of breaking early out of the #zip though. Fractionally faster and no less readable.
You can avoid tap if you do it in two lines, where the first line ends after clone, and the 2nd line is m[i][c] += 1, which many might consider more readable, though it's a more "procedural" style. re break, you're right it's slightly faster, but i think it is slightly less readable vs a purely functional, declarative solution, but that's partly a matter of taste.

cafe_ cafe_ 4653 silver badges9 bronze badges · Answer 3 · 2015-10-25 12:30:59Z

def solution(str, slice_starts, slice_ends)
 scores = { 'A' => 1, 'C' => 2, 'G' => 3, 'T' => 4 }
 occurences = scores.keys.map { |k| [k, [0]*(str.size+1)] }.to_h
 str.chars.each_with_index do |c, i| 
 scores.keys.each { |n| occurences[n][i+1] = occurences[n][i] + ((n == c) ? 1 : 0) }
 end
 slice_starts.zip(slice_ends).map do |from, to|
 scores.keys.each { |n| break(scores[n]) if occurences[n][to+1] - occurences[n][from] > 0 }
 end
end

By the way, I'd use double quotes only in string interpolations.

Stack Exchange Network

Prefix Sum in Ruby, Genomic Range Query from Codility

Updated

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Prefix Sum in Ruby, Genomic Range Query from Codility

Updated

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions