Calculating the mean differences of subarrays

Question 1

I'm trying to devise a way to calculate the mean differences (the absolute average differences between any two values in a set), of sub-arrays (starting from arbitrary indices) in an int array. I'll be placing the bounds of each subarray in to "buckets" of differing mean difference magnitudes.

The problem is, I can't find an efficient (better than \$O(n^3)\$) way of doing this. Would anyone mind helping me make the code more efficient (at least better than \$O(n^2)\$)?.

The calculations are performed server-side upon user request from my web-app. If making it more efficient isn't possible, would it be advisable to adopt a solution that doesn't involve modifying the computation, such as:

Keep this inefficient implementation and include a disclaimer stating it may take some time to complete
Transfer the calculations to several daemon threads which will perform the process on the entire user base and store the results in the database, which will be returned to the users (returned results may not be up-to-date)

I'm currently at a crossroads and am not sure which route I should take.

Pseudocode of main function:

/*The purpose of the main function is to store the mean differences of 
sub-arrays in a given array, making the sure the stored sub-arrays are as 
long as possible*/
for i in dataArray
 loop through all j (j starting from i + 1) in dataArray 
 get mean difference of values in between indexes i and j
 get int value of mean difference (bucket)
 if bucket != (bucket of mean difference of values 
 in between indexes i and j - 1), store i and j in bucket

Main function:

public static HashMap<Integer, Stack<HashMap<String, Integer>>> calculateMeanDifferences(ArrayList<Integer> dataArrayList)
{
 //Each key maps to a stack which will hold HashMaps containing the bounding indices of subArrays which have mean differences that round to the key
 HashMap<Integer, Stack<HashMap<String, Integer>>> meanDifferenceBucketHashMap = new HashMap<Integer, Stack<HashMap<String, Integer>>>();
 long size = dataArrayList.size();
 Integer previousMeanDifferenceBucket = null;
 Integer currentMeanDifferenceBucket = null;
 double currentMeanDifference = 0;
 //Major loop which starts the mean difference calculations from every index
 for(int i = 0; i < size; i++)
 {
 //Minor loop which calculates the mean differences for sub-arrays of increasing length starting from i.
 for(int j = i + 2; j < size + 1; j++)
 {
 currentMeanDifference = calculateMeanDifference(new ArrayList(dataArrayList.subList(i, j)));
 currentMeanDifferenceBucket = (int)Math.round(currentMeanDifference);
 //Ensure longest possible sub-array is recorded (so, for all i and j, if both subList(i,j-1) and subList(i, j) are recorded, they will be in different buckets)
 if((previousMeanDifferenceBucket != null && previousMeanDifferenceBucket != currentMeanDifferenceBucket) || j == size)
 {
 HashMap<String, Integer> previousSubArrayBoundsHashMap = new HashMap<String, Integer>();
 previousSubArrayBoundsHashMap.put("start", i);
 previousSubArrayBoundsHashMap.put("onePastEnd", j);
 if(!meanDifferenceBucketHashMap.containsKey(previousMeanDifferenceBucket))
 meanDifferenceBucketHashMap.put(previousMeanDifferenceBucket, new Stack<HashMap<String, Integer>>());
 else
 meanDifferenceBucketHashMap.get(previousMeanDifferenceBucket).push(previousSubArrayBoundsHashMap);
 }
 previousMeanDifferenceBucket = currentMeanDifferenceBucket;
 }
 previousMeanDifferenceBucket = currentMeanDifferenceBucket = null;
 }
 return meanDifferenceBucketHashMap;
}

Sub-function (not necessary to look over, just know it is \$O(n)\$ for my use case (all values guaranteed to be in the set [0,5])):

public static double calculateMeanDifference(ArrayList<Integer> valuesArrayList)
{
 HashMap<Integer, Double> valueCountsHashMap = new HashMap<Integer, Double>();
 double size = valuesArrayList.size();
 for(int i = 0; i < size; i++)
 {
 int currentValue = valuesArrayList.get(i);
 if(!valueCountsHashMap.containsKey(currentValue))
 valueCountsHashMap.put(currentValue, new Double(1));
 else
 valueCountsHashMap.put(currentValue, valueCountsHashMap.get(currentValue)+ 1);
 }
 double sum = 0;
 for(Map.Entry<Integer, Double> valueCountKeyValuePair : valueCountsHashMap.entrySet())
 {
 int currentValue = valueCountKeyValuePair.getKey();
 Double currentCount = valueCountKeyValuePair.getValue();
 for(Map.Entry<Integer, Double> valueCountKeyValuePair1 : valueCountsHashMap.entrySet())
 {
 int loopValue = valueCountKeyValuePair1.getKey();
 Double loopCount = valueCountKeyValuePair1.getValue();
 sum += (currentValue != loopValue ? Math.abs(currentValue - loopValue) * loopCount * currentCount : 0);
 }
 }
 return new Double( sum/ (size * (size - 1)));
}

Question 2

Could you, instead of just posting your code, provide some (links to) hints on the actual math which you are trying to implement, as well as pseudocode of your algorithm? This would make it much easier to see if there are any shortcuts you are missing.

Question 3

@TomasLycken: The methodology behind calculating the mean difference average can be found in the first img in my SO post here. This isn't of particular interest though, it will always O(N). I'll include pseudocode of what i'm trying to do right now. It'll be up in a couple of minutes

Question 4

Mean difference calculation is O(n^2) surely?

Question 5

@TimGee: Depending on the implementation (see the linked post in my first comment). Using this particular function with random values in the dataArray, it will be O(n^2). For my use case, however, it is O(n), since all the values in the data array are guarenteed to have values in the set [0,5].

Question 6

It's still not clear what you are generally trying to achieve and what buckets are exactly.

Question 7

The first optimisation is to avoid using Double and Integer when you really want double and int You can use Trove4j to act as collection of these types. Additionally I wouldn't use double when you mean to use int, e.g. for counts which can only be whole numbers.

I would also avoid counting the combinations twice. e.g. if you have an array of two you compare both a1 - a2 and a2 - a1 which should be the same.

Can you make any assumptions on the range of int values?

This example

public static void main(String... args) throws IOException {
 Random rand = new Random(1);
 int size = 100000;
 TIntArrayList valuesArrayList = new TIntArrayList(size);
 for (int i = 0; i < size; i++) {
 valuesArrayList.add(rand.nextInt(size) - rand.nextInt(size));
 }
 long start = System.nanoTime();
 double d = calculateMeanDifference(valuesArrayList);
 long time = System.nanoTime() - start;
 System.out.printf("Took %.3f seconds to calculate d=%.3f for size=%,d values%n", time / 1e9, d, size);
}
public static double calculateMeanDifference(TIntArrayList valuesArrayList) {
 final TIntIntHashMap counts = new TIntIntHashMap();
 valuesArrayList.forEach(new TIntProcedure() {
 @Override
 public boolean execute(int value) {
 counts.adjustOrPutValue(value, 1, 1);
 return true;
 }
 });
 final int[] uniqueValues = new int[counts.size()], uniqueCounts = new int[counts.size()];
 counts.forEachEntry(new TIntIntProcedure() {
 int i = 0;
 @Override
 public boolean execute(int a, int b) {
 uniqueValues[i] = a;
 uniqueCounts[i++] = b;
 return true;
 }
 });
 long sum = 0;
 for (int i = 1; i < uniqueValues.length; i++) {
 int vi = uniqueValues[i];
 int ci = uniqueCounts[i];
 for (int j = 0; j < i; j++) {
 int counts2 = ci * uniqueCounts[j];
 sum += Math.abs(vi - uniqueValues[j]) * counts2;
 }
 }
 return 2.0 * sum / valuesArrayList.size() / (valuesArrayList.size() - 1);
}

prints

Took 12.173 seconds to calculate d=46786.434 for size=100,000 values

Question 8

Thanks for the tips. In actuality, I'll be processing an array of objects from which the values will be retrieved, so I won't be able to use TIntArrayList. And yes, an important assumption can be made for the values processed: they all will have values in the range [0,5]. I'm a bit confused as to why count is being increased by the product of every two unique counts processed. In the return statement, what is count supposed to represent? In the official formula, the denominator is N * (N - 1), where N, in your code, is the summation of all values in uniqueCounts[]

Question 9

Additionally, in your code you loop through the map just to insert unique values and counts in to seperate arrays, then iterate through them. I know in theory iterating through arrays is faster than going over each Map.entry, but won't having to iterate through 100,000 elements once (map, nested loop in my code) beat doing it twice (creation of value& count arrays, then nested loop for processing of the arrays) every time? Am I missing something?

Question 10

Ah. I assume your stance is that the avoidance of calculating |a[i] - a[j]| when |a[j] - a[i]| is already known may offset the extra processing done by your code. I suppose I could use a map of pairs to store and re-use the calculated mini-sums to avoid doing that in my code

Question 11

You are right that count is not strictly needed. Since you mutliplying by the values, I don't see how they can be objects. You are performing O(n^2) lookups so iterating to build an array isn't significant. It also simplifies accessing i < j

Question 12

What I mean to say is that the argument array will actually be an array of Maps, each of which will contain the value of interest. So if mapArray is the input array, mapArray[i].get("value") for each i would give the values we are computing the mean difference for. So, the input parameter can't be of type TIntArrayList. In any case, your method is SUPER fast. I'll definately be using it. Thanks.

Question 13

Let us have two functions:

aggr_sum(a, i, j) : |a[i]-a[j]| +....+ |a[j-1]-a[j]|
sum_difference(a,i,j): sum of difference of all elements from a[i]...a[j]

You can calculate aggr_sum for all values of i and j in O(n^2) time:

aggr_sum(a, i, j) = aggr_sum(a, i+1, j) + |a[i]-a[j]|

Now,

sum_difference(a,i,j) = sum_difference(a,i,j-1)+aggr_sum(a,i,j)

Which is also O(n^2)

Then you can divide sum_difference by count to get mean_difference.

Question 14

I see. This is an attempt at the incremental approach that Tim Gee alluded to in a comment in the original post. However, I don't think that's possibe for the mean difference calculation. Once a new number is added in to the mix, you have to add: |(new number) - a[i]| for all i in a to the sum of the previous mean difference calculation; you can't just perform one more calculation and increment the number of elements to divide by

Question 15

@Kevin All I am saying is, when ever you add a new value you can update the calculations in O(n) steps (courtesy aggr_sum function). So even for incremental addition complexity is still O(n^2).

Question 16

I ended up changing what I wanted to analyze in order decrease the amount of computation required. The solution is scenario-specific and involves analyzing groups of the array elements (represented as numbers in the original question, but are actually Objects containing these values) created on the same day, rather than analyzing all the contiguous subsets of the array.

A solution to the actual problem still hasn't been proposed, so I'll leave it unanswered in hopes that someone will come up with one!

Peter Lawrey Peter Lawrey 7183 silver badges7 bronze badges · Answer 1 · 2012-01-21 09:01:30Z

The first optimisation is to avoid using Double and Integer when you really want double and int You can use Trove4j to act as collection of these types. Additionally I wouldn't use double when you mean to use int, e.g. for counts which can only be whole numbers.

I would also avoid counting the combinations twice. e.g. if you have an array of two you compare both a1 - a2 and a2 - a1 which should be the same.

Can you make any assumptions on the range of int values?

This example

public static void main(String... args) throws IOException {
 Random rand = new Random(1);
 int size = 100000;
 TIntArrayList valuesArrayList = new TIntArrayList(size);
 for (int i = 0; i < size; i++) {
 valuesArrayList.add(rand.nextInt(size) - rand.nextInt(size));
 }
 long start = System.nanoTime();
 double d = calculateMeanDifference(valuesArrayList);
 long time = System.nanoTime() - start;
 System.out.printf("Took %.3f seconds to calculate d=%.3f for size=%,d values%n", time / 1e9, d, size);
}
public static double calculateMeanDifference(TIntArrayList valuesArrayList) {
 final TIntIntHashMap counts = new TIntIntHashMap();
 valuesArrayList.forEach(new TIntProcedure() {
 @Override
 public boolean execute(int value) {
 counts.adjustOrPutValue(value, 1, 1);
 return true;
 }
 });
 final int[] uniqueValues = new int[counts.size()], uniqueCounts = new int[counts.size()];
 counts.forEachEntry(new TIntIntProcedure() {
 int i = 0;
 @Override
 public boolean execute(int a, int b) {
 uniqueValues[i] = a;
 uniqueCounts[i++] = b;
 return true;
 }
 });
 long sum = 0;
 for (int i = 1; i < uniqueValues.length; i++) {
 int vi = uniqueValues[i];
 int ci = uniqueCounts[i];
 for (int j = 0; j < i; j++) {
 int counts2 = ci * uniqueCounts[j];
 sum += Math.abs(vi - uniqueValues[j]) * counts2;
 }
 }
 return 2.0 * sum / valuesArrayList.size() / (valuesArrayList.size() - 1);
}

prints

Took 12.173 seconds to calculate d=46786.434 for size=100,000 values

Thanks for the tips. In actuality, I'll be processing an array of objects from which the values will be retrieved, so I won't be able to use TIntArrayList. And yes, an important assumption can be made for the values processed: they all will have values in the range [0,5]. I'm a bit confused as to why count is being increased by the product of every two unique counts processed. In the return statement, what is count supposed to represent? In the official formula, the denominator is N * (N - 1), where N, in your code, is the summation of all values in uniqueCounts[]
Additionally, in your code you loop through the map just to insert unique values and counts in to seperate arrays, then iterate through them. I know in theory iterating through arrays is faster than going over each Map.entry, but won't having to iterate through 100,000 elements once (map, nested loop in my code) beat doing it twice (creation of value& count arrays, then nested loop for processing of the arrays) every time? Am I missing something?
Ah. I assume your stance is that the avoidance of calculating |a[i] - a[j]| when |a[j] - a[i]| is already known may offset the extra processing done by your code. I suppose I could use a map of pairs to store and re-use the calculated mini-sums to avoid doing that in my code
You are right that count is not strictly needed. Since you mutliplying by the values, I don't see how they can be objects. You are performing O(n^2) lookups so iterating to build an array isn't significant. It also simplifies accessing i < j
What I mean to say is that the argument array will actually be an array of Maps, each of which will contain the value of interest. So if mapArray is the input array, mapArray[i].get("value") for each i would give the values we are computing the mean difference for. So, the input parameter can't be of type TIntArrayList. In any case, your method is SUPER fast. I'll definately be using it. Thanks.

ElKaminaElKamina · Answer 2 · 2012-01-21 00:08:27Z

1

\$\begingroup\$

Let us have two functions:

aggr_sum(a, i, j) : |a[i]-a[j]| +....+ |a[j-1]-a[j]|
sum_difference(a,i,j): sum of difference of all elements from a[i]...a[j]

You can calculate aggr_sum for all values of i and j in O(n^2) time:

aggr_sum(a, i, j) = aggr_sum(a, i+1, j) + |a[i]-a[j]|

Now,

sum_difference(a,i,j) = sum_difference(a,i,j-1)+aggr_sum(a,i,j)

Which is also O(n^2)

Then you can divide sum_difference by count to get mean_difference.

Share

answered Jan 21, 2012 at 0:08

ElKaminaElKamina

\$\endgroup\$

2

\$\begingroup\$ I see. This is an attempt at the incremental approach that Tim Gee alluded to in a comment in the original post. However, I don't think that's possibe for the mean difference calculation. Once a new number is added in to the mix, you have to add: |(new number) - a[i]| for all i in a to the sum of the previous mean difference calculation; you can't just perform one more calculation and increment the number of elements to divide by \$\endgroup\$

Kevin
– Kevin

2012年01月21日 01:26:46 +00:00
Commented Jan 21, 2012 at 1:26
\$\begingroup\$ @Kevin All I am saying is, when ever you add a new value you can update the calculations in O(n) steps (courtesy aggr_sum function). So even for incremental addition complexity is still O(n^2). \$\endgroup\$

ElKamina
– ElKamina

2012年01月21日 01:45:54 +00:00
Commented Jan 21, 2012 at 1:45

Add a comment |

Kevin Kevin 2071 silver badge4 bronze badges · Answer 3 · 2012-01-29 19:53:05Z

I ended up changing what I wanted to analyze in order decrease the amount of computation required. The solution is scenario-specific and involves analyzing groups of the array elements (represented as numbers in the original question, but are actually Objects containing these values) created on the same day, rather than analyzing all the contiguous subsets of the array.

A solution to the actual problem still hasn't been proposed, so I'll leave it unanswered in hopes that someone will come up with one!

Stack Exchange Network

Calculating the mean differences of subarrays

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Calculating the mean differences of subarrays

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions