I'm currently going through some lessons on Codility. I've just spent a couple of hours with GenomicRangeQuery, which is intended to demonstrate the use of prefix sums.
The task description is here. Broadly, the idea is that given a string and a set of arbitrary cut points one should be able to return the lowest value character in a given slice. O(N*M) complexity is what we want to avoid.
My solution scores 100%, but I would still value any feedback on style and readability.
def solution(str, slice_starts, slice_ends)
# Initiate an array to hold rolling totals of character occurrences
prefix_sums = [0,0,0,0]]
# Map possible characters to their positions in the prefix sum array
chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
str.split('').each_with_index do |char, i|
prefix_sums[i + 1] = prefix_sums[i].clone
prefix_sums[i + 1][chars[char]] += 1
end
slice_starts.each_with_index.map do |slice_start, i|
s = slice_start
e = slice_ends[i]
chars.each do |char, ii|
occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
break (ii + 1) if occurrences > 0
end
end
end
Updated
Here's my current preferred version using tips from some of the answers below. I should probably also use an array of hashes, rather than arrays, for the prefix sums, but I'm lazy.
def solution(str, slice_starts, slice_ends)
prefix_sums = [[0] * 4]
chars = { "A" => 0, "C" => 1, "G" => 2, "T" => 3 }
str.chars.each_with_index do |char, i|
prefix_sums[i + 1] = prefix_sums[i].clone
prefix_sums[i + 1][chars[char]] += 1
end
slice_starts.zip(slice_ends).map do |s, e|
chars.each do |char, ii|
occurrences = prefix_sums[e + 1][ii] - prefix_sums[s][ii]
break (ii + 1) if occurrences > 0
end
end
end
3 Answers 3
Some notes:
- Use 2-space indentation.
[[0,0,0,0]]
. Always a space after a comma.str.split('')
->str.chars
prefix_sums = [[0,0,0,0]]
. This kind of implicit structures hinder readability, I'd simply use a hash. A small space penalty (not in big-Oh, though), better readibility.each_with_index
. Usezip
instead.- As you say, the problem is about getting partial sums (frequencies). You can implement and use a very generic abstraction (Enumerable#scanl), which can be used in more codility problems.
I'd write this purely functional solution:
def solution(input_str, slice_start, slice_end)
impact_factors = {"A" => 1, "C" => 2, "G" => 3, "T" => 4}
frequencies = input_str.chars.scanl(Hash.new(0)) do |freq, n|
freq.merge(n => freq[n] + 1)
end
slice_start.zip(slice_end).map do |from, to|
difference_count_by_factor = frequencies[to+1].map do |n, count|
[impact_factors[n], count - frequencies[from][nucleotide]]
end.to_h
difference_count_by_factor.reject { |factor, count| count.zero? }.keys.min
end
end
-
1\$\begingroup\$ This is a great answer, thanks. My only reservation is implementing Enumerable#scanl might confuse other Rubyists, especially those like me without a functional background. I like your solution, but from my experience it perhaps doesn't feel like the most idiomatic Ruby. \$\endgroup\$Dae– Dae2015年08月26日 15:23:40 +00:00Commented Aug 26, 2015 at 15:23
Here's a take on the algorithm you used that takes advantage of more ruby sugar. Explanation and comments are inline:
def solution(s, p, q)
impacts = {'A'=>1, 'C'=>2, 'G'=>3, 'T'=>4}
initial = {'A'=>0, 'C'=>0, 'G'=>0, 'T'=>0}
# Same as your prefix_sums
# But using hashes to store the nucleotide counts at each index
counts_at = s.chars.each_with_object({}).with_index do |(c,m), i|
m[i] = (m[i-1] || initial).clone.tap {|x| x.update(c => x[c] + 1)}
end
# Find the necleotides guaranteed to exist in each subsequence
# Transform them into impacts, and choose the min
p.zip(q).map do |from, to|
impacts.keys.reject {|dna| counts_at[from][dna] == counts_at[to][dna]} # dna keys whose count has changed must appear
.concat([s[from]]) # the first point must appear too
.map {|dna| impacts[dna]} # turn it into impacts
.min # select the min
end
end
s = 'CAGCCTA'
p = [2,5,0]
q = [4,5,6]
p solution(s, p, q) #=> [2, 4, 1]
-
\$\begingroup\$ Is
#tap
necessary in the prefix sum calculation step? I thinkcounts_at
may be a better variable name. And I like using hashes for the counts. I think I prefer my method ofbreak
ing early out of the#zip
though. Fractionally faster and no less readable. \$\endgroup\$Dae– Dae2015年10月25日 08:56:26 +00:00Commented Oct 25, 2015 at 8:56 -
1\$\begingroup\$ You can avoid
tap
if you do it in two lines, where the first line ends afterclone
, and the 2nd line ism[i][c] += 1
, which many might consider more readable, though it's a more "procedural" style. rebreak
, you're right it's slightly faster, but i think it is slightly less readable vs a purely functional, declarative solution, but that's partly a matter of taste. \$\endgroup\$Jonah– Jonah2015年10月25日 12:57:13 +00:00Commented Oct 25, 2015 at 12:57 -
\$\begingroup\$ I do like xrthdsf's solution a lot too. \$\endgroup\$Jonah– Jonah2015年10月25日 13:02:46 +00:00Commented Oct 25, 2015 at 13:02
def solution(str, slice_starts, slice_ends)
scores = { 'A' => 1, 'C' => 2, 'G' => 3, 'T' => 4 }
occurences = scores.keys.map { |k| [k, [0]*(str.size+1)] }.to_h
str.chars.each_with_index do |c, i|
scores.keys.each { |n| occurences[n][i+1] = occurences[n][i] + ((n == c) ? 1 : 0) }
end
slice_starts.zip(slice_ends).map do |from, to|
scores.keys.each { |n| break(scores[n]) if occurences[n][to+1] - occurences[n][from] > 0 }
end
end
By the way, I'd use double quotes only in string interpolations.
Explore related questions
See similar questions with these tags.