Summing data from two arrays, grouped and sorted by date

Question 1

I have multiple multidimensional arrays that each have the same structure with different lengths.

arr1 = [["23 Jan",3],["29 Jan",10],["19 Jan",23],["25 Jan",23]]
arr2 = [["31 Jan",7],["29 Jan",12],["01 Feb",4]]

My result would be a merged array without duplicate dates ordered by the date. For the dates that get combined, like 29 Jan above, I would add the second values together.

Here is what the outcome to the above arrays should be:

arr = [["19 Jan",23],["23 Jan",3],["25 Jan",23],["29 Jan",22],["31 Jan",7],["01 Feb",4]]

As you can see the two with the date, 29 Jan, were combined and added.

Here is my current solution:

(arr1 | arr2).group_by(&:first).
 map { |k, v| [k, v.map(&:last).inject(:+)] }.
 sort_by { |x| x.first.to_date }

Which does indeed get my the correct result, I just want to refactor it. Any suggestions/insights would be appreciated.

Question 2

Looks great already, if you ask me. Other than perhaps being more descriptive than k, v, and x, I don't see any need to refactor anything here. I'd probably write the same thing.

Question 3

I don't understand the use of |, two pairs are not duplicated if they are on different arrays, aren't they? It seems you need uniq for each array.

Question 4

@tokland I think it's used instead of Array#+. Here it will have the unfortunate (or desired?) effect of ignoring items in arr2 that are equal to items in arr1 (e.g. if ["1 Jan", 1] is in both arr1 and arr2, the OPs code will return ["Jan 1", 1] instead of ["Jan 1", 2]). If this is intentional, both answers here are not valid, unfortunately. Could you clarify Justin?

Question 5

@tokland, you actually just found a flaw in my solution that I overlook. britishtea, it appears at first glance your solution addresses this.

Question 6

Could arr1 contain ["23 Jan",3] and ["23 Jan",5]. If "yes" (I'm guessing the answer is "no") and if arr2 contained ["23 Jan",7], would ["23 Jan",3+5+7] be part of your result? The efficiency as well as the correctness of the solution depends on whether arr1 or arr2 can contain pairs with the same date.

Question 7

It looks as if you have chosen the wrong data structure. The multidimensional Array looks very much like a Hash. Even your code mentions the terms key (k) and value (v).

If you were to use a Hash, you could use Hash#merge:

one = Hash[[["23 Jan",3],["29 Jan",10],["19 Jan",23],["25 Jan",23]]]
two = Hash[[["31 Jan",7],["29 Jan",12],["01 Feb",4]]]
one.merge(two) { |_value, old, new| old + new }
# => {"23 Jan"=>3, "29 Jan"=>22, "19 Jan"=>23, "25 Jan"=>23, "31 Jan"=>7, "01 Feb"=>4}

^{Note: If you're using a recent version of Ruby, you can replace Hash::[] with Array#to_hash}

If you really do need multidimensional Arrays, you could still use Hashes internally and convert to the required format (Hash#to_a) when needed.

Question 8

Yea I need the result to be a multidimensional array, but of course converting the hash to an array does that very well. this is also much more descriptive than my original solution. The only thing left is to sort the hashes by date.

Question 9

Your solution is algorithmically fine. My main critique is that it isn't obvious at a glance what the code does. In particular, k, v, :first, and :last can only be understood if you mentally step through the code to follow what the data structure looks like at each point.

I'd like to present an alternative solution. It's much less clever — four statements instead of one expression. However, it's easier to see that there is a date-to-value mapping, that I'm grouping by date, summing the values, and sorting by date.

aggregator = Hash.new(0)
arr1.each { |date, val| aggregator[date.to_date] += val }
arr2.each { |date, val| aggregator[date.to_date] += val }
aggregator.to_a.sort_by { |date, val| date }

In addition, I've chosen to call #to_date earlier. Presumably, your #to_date method would treat "01 Feb" and "1 Feb" equally, in case your input is sloppy.

Question 10

Yea, I couldn't agree more that my code presented could have be written in a clearer way. This definitely does a good job of that. The returned array's dates are formatted incorrectly though, [2015年1月19日, 23], but this can be fixed easily.

Question 11

I added your method to the benchmark after removing to_date.

Question 12

Suppose the arrays were large and you were looking for an efficient solution.

Let arr1 and arr2 denote the two arrays, with arr2 being the smaller of the two.

First, convert arr2 to a hash:

h2 = Hash[arr2]

Now map each element of arr1 and append to the result the elements of arr2 that do not have matching dates among the elements of arr1:

arr1.map { |k1,v1| [k1, h2.key?(k1) ? v1 + h2.delete(k1) : v] }.concat(h2.to_a)

Note that h2.delete(k) does double-duty: it deletes the key-value pair k,v2 from h2 and returns v2 to be added to v1.

You can of course tack on .sort to sort the array by date.

I suspect this may be reasonably fast considering that only one hash is constructed and that is for the smaller of the two arrays.

Let's see how this compares with the other methods that have been suggested.

Test case

def test(n1, n2, f1)
 m = [(n1*f1).to_i, n2].min # elements of `arr1` in `arr2`
 [Array.new(n1) { |i| [i,i] }.shuffle, 
 Array.new(n2) { |i| [i+m-n2,i+i] }.shuffle]
end

Suppose, for example, arr1 and arr2 are to have 10 and 8 tuples, respectively, and in 10*0.5 => 5 cases, the first element of a tuple in arr1 matches the first element of a tuple in arr2. Thus 0-5 => 5 tuples in arr1 have no match in arr2 and 8-5 => 3 tuples in arr2 have no match in arr1.

Then

arr1, arr2 = test(10,8,0.5)
 #=> [[[7, 7], [0, 0], [2, 2], [4, 4], [3, 3], [5, 5], [1, 1], [8, 8], [9, 9], [6, 6]],
 # [[-2, 2], [-3, 0], [3, 12], [-1, 4], [4, 14], [0, 6], [2, 10], [1, 8]]]
arr1.sort
 #=> [[0, 0], [1, 1], [2, 2], [3, 3], [4, 4], [5, 5], [6, 6], [7, 7], [8, 8], [9, 9]] 
arr2.sort
 #=> [[-3, 0], [-2, 2], [-1, 4], [0, 6], [1, 8], [2, 10], [3, 12], [4, 14]]

Methods to be benchmarked

def justin(arr1, arr2)
 (arr1 + arr2).group_by(&:first)
 .map { |k, v| [k, v.map(&:last).inject(:+)] }
 .sort_by { |x| x.first }
end

I've corrected justin by changing arr1 | arr2 to arr1 + arr2, and removed to_date.

def btea(arr1, arr2)
 one = Hash[arr1]
 two = Hash[arr2]
 one.merge(two) { |_value, old, new| old + new }.to_a
end

In btea I converted the merged hash back to an array.

def success(arr1, arr2)
 aggregator = Hash.new(0)
 arr1.each { |date, val| aggregator[date] += val }
 arr2.each { |date, val| aggregator[date] += val }
 aggregator.to_a.sort_by { |date, val| date }
end

I removed .to_date in success.

def cary(arr1, arr2)
 h2 = Hash[arr2]
 arr1.map { |d1,n1| [d1, h2.key?(d1) ? n1 + h2.delete(d1) : n1] }.concat(h2.to_a)
end

Confirm methods give the same results

arr1, arr2 = test(1000,800,0.8)
bt = btea(arr1, arr2).sort
su = success(arr1, arr2).sort
justin(arr1, arr2).sort == bt #=> true
bt == su #=> true
su == cary(arr1, arr2).sort #=> true

Benchmark code

require 'benchmark'
def bench(arr1, arr2)
 Benchmark.bm(10) do |bm|
 bm.report "justin" do
 justin(arr1, arr2)
 end
 bm.report "btea" do
 btea(arr1, arr2)
 end
 bm.report "success" do
 success(arr1, arr2)
 end
 bm.report "cary" do
 cary(arr1,arr2)
 end
 end
end

Results

arr1, arr2 = test(1_000_000, 800_000, 0.8)
bench(arr1, arr2)
 user system total real
justin 6.650000 0.160000 6.810000 ( 6.855140)
btea 5.290000 0.100000 5.390000 ( 5.439332)
success 2.970000 0.020000 2.990000 ( 2.999598)
cary 3.970000 0.110000 4.080000 ( 4.090756)
arr1, arr2 = test(2_000_000, 200_000, 0.5)
bench(arr1, arr2)
 user system total real
justin 10.690000 0.190000 10.880000 ( 10.958374)
btea 8.650000 0.180000 8.830000 ( 8.868550)
success 5.240000 0.010000 5.250000 ( 5.268781)
cary 1.330000 0.020000 1.350000 ( 1.351937)
arr1, arr2 = test(1_000_000, 1_000_000, 0.5)
bench(arr1, arr2)
 user system total real
justin 10.660000 0.160000 10.820000 ( 10.845127)
btea 5.810000 0.110000 5.920000 ( 5.922358)
success 4.060000 0.020000 4.080000 ( 4.078427)
cary 2.400000 0.040000 2.440000 ( 2.447106)

britishtea britishtea 4272 silver badges7 bronze badges · Accepted Answer · 2015-02-05 13:55:44Z

It looks as if you have chosen the wrong data structure. The multidimensional Array looks very much like a Hash. Even your code mentions the terms key (k) and value (v).

If you were to use a Hash, you could use Hash#merge:

one = Hash[[["23 Jan",3],["29 Jan",10],["19 Jan",23],["25 Jan",23]]]
two = Hash[[["31 Jan",7],["29 Jan",12],["01 Feb",4]]]
one.merge(two) { |_value, old, new| old + new }
# => {"23 Jan"=>3, "29 Jan"=>22, "19 Jan"=>23, "25 Jan"=>23, "31 Jan"=>7, "01 Feb"=>4}

^{Note: If you're using a recent version of Ruby, you can replace Hash::[] with Array#to_hash}

If you really do need multidimensional Arrays, you could still use Hashes internally and convert to the required format (Hash#to_a) when needed.

Yea I need the result to be a multidimensional array, but of course converting the hash to an array does that very well. this is also much more descriptive than my original solution. The only thing left is to sort the hashes by date.

Stack Exchange Network

Summing data from two arrays, grouped and sorted by date

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Summing data from two arrays, grouped and sorted by date

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions