I'm trying to devise a way to calculate the mean differences (the absolute average differences between any two values in a set), of sub-arrays (starting from arbitrary indices) in an int
array. I'll be placing the bounds of each subarray in to "buckets" of differing mean difference magnitudes.
The problem is, I can't find an efficient (better than \$O(n^3)\$) way of doing this. Would anyone mind helping me make the code more efficient (at least better than \$O(n^2)\$)?.
The calculations are performed server-side upon user request from my web-app. If making it more efficient isn't possible, would it be advisable to adopt a solution that doesn't involve modifying the computation, such as:
- Keep this inefficient implementation and include a disclaimer stating it may take some time to complete
- Transfer the calculations to several daemon threads which will perform the process on the entire user base and store the results in the database, which will be returned to the users (returned results may not be up-to-date)
I'm currently at a crossroads and am not sure which route I should take.
Pseudocode of main function:
/*The purpose of the main function is to store the mean differences of sub-arrays in a given array, making the sure the stored sub-arrays are as long as possible*/ for i in dataArray loop through all j (j starting from i + 1) in dataArray get mean difference of values in between indexes i and j get int value of mean difference (bucket) if bucket != (bucket of mean difference of values in between indexes i and j - 1), store i and j in bucket
Main function:
public static HashMap<Integer, Stack<HashMap<String, Integer>>> calculateMeanDifferences(ArrayList<Integer> dataArrayList)
{
//Each key maps to a stack which will hold HashMaps containing the bounding indices of subArrays which have mean differences that round to the key
HashMap<Integer, Stack<HashMap<String, Integer>>> meanDifferenceBucketHashMap = new HashMap<Integer, Stack<HashMap<String, Integer>>>();
long size = dataArrayList.size();
Integer previousMeanDifferenceBucket = null;
Integer currentMeanDifferenceBucket = null;
double currentMeanDifference = 0;
//Major loop which starts the mean difference calculations from every index
for(int i = 0; i < size; i++)
{
//Minor loop which calculates the mean differences for sub-arrays of increasing length starting from i.
for(int j = i + 2; j < size + 1; j++)
{
currentMeanDifference = calculateMeanDifference(new ArrayList(dataArrayList.subList(i, j)));
currentMeanDifferenceBucket = (int)Math.round(currentMeanDifference);
//Ensure longest possible sub-array is recorded (so, for all i and j, if both subList(i,j-1) and subList(i, j) are recorded, they will be in different buckets)
if((previousMeanDifferenceBucket != null && previousMeanDifferenceBucket != currentMeanDifferenceBucket) || j == size)
{
HashMap<String, Integer> previousSubArrayBoundsHashMap = new HashMap<String, Integer>();
previousSubArrayBoundsHashMap.put("start", i);
previousSubArrayBoundsHashMap.put("onePastEnd", j);
if(!meanDifferenceBucketHashMap.containsKey(previousMeanDifferenceBucket))
meanDifferenceBucketHashMap.put(previousMeanDifferenceBucket, new Stack<HashMap<String, Integer>>());
else
meanDifferenceBucketHashMap.get(previousMeanDifferenceBucket).push(previousSubArrayBoundsHashMap);
}
previousMeanDifferenceBucket = currentMeanDifferenceBucket;
}
previousMeanDifferenceBucket = currentMeanDifferenceBucket = null;
}
return meanDifferenceBucketHashMap;
}
Sub-function (not necessary to look over, just know it is \$O(n)\$ for my use case (all values guaranteed to be in the set [0,5])):
public static double calculateMeanDifference(ArrayList<Integer> valuesArrayList)
{
HashMap<Integer, Double> valueCountsHashMap = new HashMap<Integer, Double>();
double size = valuesArrayList.size();
for(int i = 0; i < size; i++)
{
int currentValue = valuesArrayList.get(i);
if(!valueCountsHashMap.containsKey(currentValue))
valueCountsHashMap.put(currentValue, new Double(1));
else
valueCountsHashMap.put(currentValue, valueCountsHashMap.get(currentValue)+ 1);
}
double sum = 0;
for(Map.Entry<Integer, Double> valueCountKeyValuePair : valueCountsHashMap.entrySet())
{
int currentValue = valueCountKeyValuePair.getKey();
Double currentCount = valueCountKeyValuePair.getValue();
for(Map.Entry<Integer, Double> valueCountKeyValuePair1 : valueCountsHashMap.entrySet())
{
int loopValue = valueCountKeyValuePair1.getKey();
Double loopCount = valueCountKeyValuePair1.getValue();
sum += (currentValue != loopValue ? Math.abs(currentValue - loopValue) * loopCount * currentCount : 0);
}
}
return new Double( sum/ (size * (size - 1)));
}
-
5\$\begingroup\$ Could you, instead of just posting your code, provide some (links to) hints on the actual math which you are trying to implement, as well as pseudocode of your algorithm? This would make it much easier to see if there are any shortcuts you are missing. \$\endgroup\$Tomas Aschan– Tomas Aschan2012年01月20日 23:32:22 +00:00Commented Jan 20, 2012 at 23:32
-
\$\begingroup\$ @TomasLycken: The methodology behind calculating the mean difference average can be found in the first img in my SO post here. This isn't of particular interest though, it will always O(N). I'll include pseudocode of what i'm trying to do right now. It'll be up in a couple of minutes \$\endgroup\$Kevin– Kevin2012年01月20日 23:39:06 +00:00Commented Jan 20, 2012 at 23:39
-
1\$\begingroup\$ Mean difference calculation is O(n^2) surely? \$\endgroup\$Tim Gee– Tim Gee2012年01月20日 23:49:09 +00:00Commented Jan 20, 2012 at 23:49
-
\$\begingroup\$ @TimGee: Depending on the implementation (see the linked post in my first comment). Using this particular function with random values in the dataArray, it will be O(n^2). For my use case, however, it is O(n), since all the values in the data array are guarenteed to have values in the set [0,5]. \$\endgroup\$Kevin– Kevin2012年01月20日 23:52:05 +00:00Commented Jan 20, 2012 at 23:52
-
1\$\begingroup\$ It's still not clear what you are generally trying to achieve and what buckets are exactly. \$\endgroup\$cyborg– cyborg2012年01月21日 00:01:45 +00:00Commented Jan 21, 2012 at 0:01
3 Answers 3
The first optimisation is to avoid using Double
and Integer
when you really want double
and int
You can use Trove4j to act as collection of these types. Additionally I wouldn't use double when you mean to use int, e.g. for counts which can only be whole numbers.
I would also avoid counting the combinations twice. e.g. if you have an array of two you compare both a1 - a2 and a2 - a1 which should be the same.
Can you make any assumptions on the range of int values?
This example
public static void main(String... args) throws IOException {
Random rand = new Random(1);
int size = 100000;
TIntArrayList valuesArrayList = new TIntArrayList(size);
for (int i = 0; i < size; i++) {
valuesArrayList.add(rand.nextInt(size) - rand.nextInt(size));
}
long start = System.nanoTime();
double d = calculateMeanDifference(valuesArrayList);
long time = System.nanoTime() - start;
System.out.printf("Took %.3f seconds to calculate d=%.3f for size=%,d values%n", time / 1e9, d, size);
}
public static double calculateMeanDifference(TIntArrayList valuesArrayList) {
final TIntIntHashMap counts = new TIntIntHashMap();
valuesArrayList.forEach(new TIntProcedure() {
@Override
public boolean execute(int value) {
counts.adjustOrPutValue(value, 1, 1);
return true;
}
});
final int[] uniqueValues = new int[counts.size()], uniqueCounts = new int[counts.size()];
counts.forEachEntry(new TIntIntProcedure() {
int i = 0;
@Override
public boolean execute(int a, int b) {
uniqueValues[i] = a;
uniqueCounts[i++] = b;
return true;
}
});
long sum = 0;
for (int i = 1; i < uniqueValues.length; i++) {
int vi = uniqueValues[i];
int ci = uniqueCounts[i];
for (int j = 0; j < i; j++) {
int counts2 = ci * uniqueCounts[j];
sum += Math.abs(vi - uniqueValues[j]) * counts2;
}
}
return 2.0 * sum / valuesArrayList.size() / (valuesArrayList.size() - 1);
}
prints
Took 12.173 seconds to calculate d=46786.434 for size=100,000 values
-
\$\begingroup\$ Thanks for the tips. In actuality, I'll be processing an array of objects from which the values will be retrieved, so I won't be able to use TIntArrayList. And yes, an important assumption can be made for the values processed: they all will have values in the range [0,5]. I'm a bit confused as to why
count
is being increased by the product of every two unique counts processed. In the return statement, what iscount
supposed to represent? In the official formula, the denominator isN * (N - 1)
, whereN
, in your code, is the summation of all values inuniqueCounts[]
\$\endgroup\$Kevin– Kevin2012年01月21日 11:05:11 +00:00Commented Jan 21, 2012 at 11:05 -
\$\begingroup\$ Additionally, in your code you loop through the map just to insert unique values and counts in to seperate arrays, then iterate through them. I know in theory iterating through arrays is faster than going over each
Map.entry
, but won't having to iterate through 100,000 elements once (map, nested loop in my code) beat doing it twice (creation of value& count arrays, then nested loop for processing of the arrays) every time? Am I missing something? \$\endgroup\$Kevin– Kevin2012年01月21日 11:14:24 +00:00Commented Jan 21, 2012 at 11:14 -
\$\begingroup\$ Ah. I assume your stance is that the avoidance of calculating
|a[i] - a[j]|
when|a[j] - a[i]|
is already known may offset the extra processing done by your code. I suppose I could use a map of pairs to store and re-use the calculated mini-sums to avoid doing that in my code \$\endgroup\$Kevin– Kevin2012年01月21日 11:24:21 +00:00Commented Jan 21, 2012 at 11:24 -
\$\begingroup\$ You are right that count is not strictly needed. Since you mutliplying by the values, I don't see how they can be objects. You are performing O(n^2) lookups so iterating to build an array isn't significant. It also simplifies accessing
i < j
\$\endgroup\$Peter Lawrey– Peter Lawrey2012年01月21日 12:35:17 +00:00Commented Jan 21, 2012 at 12:35 -
\$\begingroup\$ What I mean to say is that the argument array will actually be an array of Maps, each of which will contain the value of interest. So if
mapArray
is the input array,mapArray[i].get("value")
for eachi
would give the values we are computing the mean difference for. So, the input parameter can't be of typeTIntArrayList
. In any case, your method is SUPER fast. I'll definately be using it. Thanks. \$\endgroup\$Kevin– Kevin2012年01月21日 16:23:20 +00:00Commented Jan 21, 2012 at 16:23
Let us have two functions:
aggr_sum(a, i, j) : |a[i]-a[j]| +....+ |a[j-1]-a[j]|
sum_difference(a,i,j): sum of difference of all elements from a[i]...a[j]
You can calculate aggr_sum for all values of i and j in O(n^2) time:
aggr_sum(a, i, j) = aggr_sum(a, i+1, j) + |a[i]-a[j]|
Now,
sum_difference(a,i,j) = sum_difference(a,i,j-1)+aggr_sum(a,i,j)
Which is also O(n^2)
Then you can divide sum_difference by count to get mean_difference.
-
\$\begingroup\$ I see. This is an attempt at the incremental approach that Tim Gee alluded to in a comment in the original post. However, I don't think that's possibe for the mean difference calculation. Once a new number is added in to the mix, you have to add:
|(new number) - a[i]|
for all i in a to the sum of the previous mean difference calculation; you can't just perform one more calculation and increment the number of elements to divide by \$\endgroup\$Kevin– Kevin2012年01月21日 01:26:46 +00:00Commented Jan 21, 2012 at 1:26 -
\$\begingroup\$ @Kevin All I am saying is, when ever you add a new value you can update the calculations in O(n) steps (courtesy aggr_sum function). So even for incremental addition complexity is still O(n^2). \$\endgroup\$ElKamina– ElKamina2012年01月21日 01:45:54 +00:00Commented Jan 21, 2012 at 1:45
I ended up changing what I wanted to analyze in order decrease the amount of computation required. The solution is scenario-specific and involves analyzing groups of the array elements (represented as numbers in the original question, but are actually Objects containing these values) created on the same day, rather than analyzing all the contiguous subsets of the array.
A solution to the actual problem still hasn't been proposed, so I'll leave it unanswered in hopes that someone will come up with one!