7
\$\begingroup\$

I'm currently going through some lessons on Codility. I've just spent a couple of hours with GenomicRangeQuery, which is intended to demonstrate the use of prefix sums.

The task description is here. Broadly, the idea is that given a string and a set of arbitrary cut points one should be able to return the lowest value character in a given slice. O(N*M) complexity is what we want to avoid.

My solution scores 100%, but I would still value any feedback on style and readability.

def solution(str, slice_starts, slice_ends)
 # Initiate an array to hold rolling totals of character occurrences
 prefix_sums = [0,0,0,0]]
 # Map possible characters to their positions in the prefix sum array
 chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
 str.split('').each_with_index do |char, i|
 prefix_sums[i + 1] = prefix_sums[i].clone
 prefix_sums[i + 1][chars[char]] += 1
 end
 slice_starts.each_with_index.map do |slice_start, i|
 s = slice_start
 e = slice_ends[i]
 chars.each do |char, ii|
 occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
 break (ii + 1) if occurrences > 0
 end
 end
end

Updated

Here's my current preferred version using tips from some of the answers below. I should probably also use an array of hashes, rather than arrays, for the prefix sums, but I'm lazy.

def solution(str, slice_starts, slice_ends)
 prefix_sums = [[0] * 4]
 chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
 str.chars.each_with_index do |char, i|
 prefix_sums[i + 1] = prefix_sums[i].clone
 prefix_sums[i + 1][chars[char]] += 1
 end
 slice_starts.zip(slice_ends).map do |s, e|
 chars.each do |char, ii|
 occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
 break (ii + 1) if occurrences > 0
 end
 end
end
asked Aug 17, 2015 at 17:24
\$\endgroup\$
0

3 Answers 3

2
\$\begingroup\$

Some notes:

  • Use 2-space indentation.
  • [[0,0,0,0]]. Always a space after a comma.
  • str.split('') ->str.chars
  • prefix_sums = [[0,0,0,0]]. This kind of implicit structures hinder readability, I'd simply use a hash. A small space penalty (not in big-Oh, though), better readibility.
  • each_with_index. Use zip instead.
  • As you say, the problem is about getting partial sums (frequencies). You can implement and use a very generic abstraction (Enumerable#scanl), which can be used in more codility problems.

I'd write this purely functional solution:

def solution(input_str, slice_start, slice_end)
 impact_factors = {"A" => 1, "C" => 2, "G" => 3, "T" => 4}
 frequencies = input_str.chars.scanl(Hash.new(0)) do |freq, n|
 freq.merge(n => freq[n] + 1)
 end
 slice_start.zip(slice_end).map do |from, to|
 difference_count_by_factor = frequencies[to+1].map do |n, count| 
 [impact_factors[n], count - frequencies[from][nucleotide]]
 end.to_h
 difference_count_by_factor.reject { |factor, count| count.zero? }.keys.min
 end
end
answered Aug 19, 2015 at 9:45
\$\endgroup\$
1
  • 1
    \$\begingroup\$ This is a great answer, thanks. My only reservation is implementing Enumerable#scanl might confuse other Rubyists, especially those like me without a functional background. I like your solution, but from my experience it perhaps doesn't feel like the most idiomatic Ruby. \$\endgroup\$ Commented Aug 26, 2015 at 15:23
1
\$\begingroup\$

Here's a take on the algorithm you used that takes advantage of more ruby sugar. Explanation and comments are inline:

def solution(s, p, q)
 impacts = {'A'=>1, 'C'=>2, 'G'=>3, 'T'=>4}
 initial = {'A'=>0, 'C'=>0, 'G'=>0, 'T'=>0}
 # Same as your prefix_sums
 # But using hashes to store the nucleotide counts at each index
 counts_at = s.chars.each_with_object({}).with_index do |(c,m), i|
 m[i] = (m[i-1] || initial).clone.tap {|x| x.update(c => x[c] + 1)}
 end
 # Find the necleotides guaranteed to exist in each subsequence
 # Transform them into impacts, and choose the min
 p.zip(q).map do |from, to|
 impacts.keys.reject {|dna| counts_at[from][dna] == counts_at[to][dna]} # dna keys whose count has changed must appear
 .concat([s[from]]) # the first point must appear too
 .map {|dna| impacts[dna]} # turn it into impacts
 .min # select the min
 end
end
s = 'CAGCCTA'
p = [2,5,0]
q = [4,5,6]
p solution(s, p, q) #=> [2, 4, 1]
answered Oct 25, 2015 at 5:07
\$\endgroup\$
3
  • \$\begingroup\$ Is #tap necessary in the prefix sum calculation step? I think counts_at may be a better variable name. And I like using hashes for the counts. I think I prefer my method of breaking early out of the #zip though. Fractionally faster and no less readable. \$\endgroup\$ Commented Oct 25, 2015 at 8:56
  • 1
    \$\begingroup\$ You can avoid tap if you do it in two lines, where the first line ends after clone, and the 2nd line is m[i][c] += 1, which many might consider more readable, though it's a more "procedural" style. re break, you're right it's slightly faster, but i think it is slightly less readable vs a purely functional, declarative solution, but that's partly a matter of taste. \$\endgroup\$ Commented Oct 25, 2015 at 12:57
  • \$\begingroup\$ I do like xrthdsf's solution a lot too. \$\endgroup\$ Commented Oct 25, 2015 at 13:02
1
\$\begingroup\$
def solution(str, slice_starts, slice_ends)
 scores = { 'A' => 1, 'C' => 2, 'G' => 3, 'T' => 4 }
 occurences = scores.keys.map { |k| [k, [0]*(str.size+1)] }.to_h
 str.chars.each_with_index do |c, i| 
 scores.keys.each { |n| occurences[n][i+1] = occurences[n][i] + ((n == c) ? 1 : 0) }
 end
 slice_starts.zip(slice_ends).map do |from, to|
 scores.keys.each { |n| break(scores[n]) if occurences[n][to+1] - occurences[n][from] > 0 }
 end
end

By the way, I'd use double quotes only in string interpolations.

answered Oct 25, 2015 at 12:30
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.