I'm supposed to try and optimize this method for my team, working on a video decoder in Java, though I don't see any good approach to doing so. The function, below, doesn't seem like it could be speed up in any significant amount, since it contains mostly simple addition/subtraction/etc.
void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
int[] workSpace = new int[64];
short[] data = new short[64];
int z1, z2, z3, z4, z5;
int tmp0, tmp1, tmp2, tmp3;
int tmp10, tmp11, tmp12, tmp13;
int pointer = 0;
for (int index = 8; index > 0; index--) {
if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;
workSpace[pointer + 0] = dcValue;
workSpace[pointer + 8] = dcValue;
workSpace[pointer + 16] = dcValue;
workSpace[pointer + 24] = dcValue;
workSpace[pointer + 32] = dcValue;
workSpace[pointer + 40] = dcValue;
workSpace[pointer + 48] = dcValue;
workSpace[pointer + 56] = dcValue;
pointer++;
continue;
}
z2 = dataBlockBuffer[pointer + 16];
z3 = dataBlockBuffer[pointer + 48];
z1 = (z2 + z3) * FIX_0_541196100;
tmp2 = z1 + z3 * -FIX_1_847759065;
tmp3 = z1 + z2 * FIX_0_765366865;
z2 = dataBlockBuffer[pointer];
z3 = dataBlockBuffer[pointer + 32];
tmp0 = (z2 + z3) << BITS;
tmp1 = (z2 - z3) << BITS;
tmp10 = tmp0 + tmp3;
tmp13 = tmp0 - tmp3;
tmp11 = tmp1 + tmp2;
tmp12 = tmp1 - tmp2;
tmp0 = dataBlockBuffer[pointer + 56];
tmp1 = dataBlockBuffer[pointer + 40];
tmp2 = dataBlockBuffer[pointer + 24];
tmp3 = dataBlockBuffer[pointer + 8];
z1 = tmp0 + tmp3;
z2 = tmp1 + tmp2;
z3 = tmp0 + tmp2;
z4 = tmp1 + tmp3;
z5 = (z3 + z4) * FIX_1_175875602;
tmp0 = tmp0 * FIX_0_298631336;
tmp1 = tmp1 * FIX_2_053119869;
tmp2 = tmp2 * FIX_3_072711026;
tmp3 = tmp3 * FIX_1_501321110;
z1 = z1 * -FIX_0_899976223;
z2 = z2 * -FIX_2_562915447;
z3 = z3 * -FIX_1_961570560;
z4 = z4 * -FIX_0_390180644;
z3 += z5;
z4 += z5;
tmp0 += z1 + z3;
tmp1 += z2 + z4;
tmp2 += z2 + z3;
tmp3 += z1 + z4;
workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);
pointer++;
}
pointer = 0;
for (int index = 0; index < 8; index++) {
z2 = workSpace[pointer + 2];
z3 = workSpace[pointer + 6];
z1 = (z2 + z3) * FIX_0_541196100;
tmp2 = z1 + z3 * -FIX_1_847759065;
tmp3 = z1 + z2 * FIX_0_765366865;
tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;
tmp10 = tmp0 + tmp3;
tmp13 = tmp0 - tmp3;
tmp11 = tmp1 + tmp2;
tmp12 = tmp1 - tmp2;
tmp0 = workSpace[pointer + 7];
tmp1 = workSpace[pointer + 5];
tmp2 = workSpace[pointer + 3];
tmp3 = workSpace[pointer + 1];
z1 = tmp0 + tmp3;
z2 = tmp1 + tmp2;
z3 = tmp0 + tmp2;
z4 = tmp1 + tmp3;
z5 = (z3 + z4) * FIX_1_175875602;
tmp0 = tmp0 * FIX_0_298631336;
tmp1 = tmp1 * FIX_2_053119869;
tmp2 = tmp2 * FIX_3_072711026;
tmp3 = tmp3 * FIX_1_501321110;
z1 = z1 * -FIX_0_899976223;
z2 = z2 * -FIX_2_562915447;
z3 = z3 * -FIX_1_961570560;
z4 = z4 * -FIX_0_390180644;
z3 += z5;
z4 += z5;
tmp0 += z1 + z3;
tmp1 += z2 + z4;
tmp2 += z2 + z3;
tmp3 += z1 + z4;
data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);
pointer += 8;
}
short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
for (int i = 0; i < data.length; i++)
temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}
Should I combine the basic math if I can, or what would you suggest?
-
1Most will suggest: Find where performance decrease is, create a smaller problem to display and re ask the question with that informationJohn Vint– John Vint2011年07月11日 16:55:57 +00:00Commented Jul 11, 2011 at 16:55
-
Well, the two for loops take nearly the same amount of time so performance decreases in the method as a whole. I could have split the display into each individual for loop, but then it would be in two separate posts.Hogan Whitaker– Hogan Whitaker2011年07月11日 17:11:18 +00:00Commented Jul 11, 2011 at 17:11
-
1Write it in assembler. On a serious note, though, I didn't see any shortcuts... but your completely non-descript variable names mean nothing to me, either, so I didn't exactly spend a ton of time trying to figure out how it was doing what it was doing.MirroredFate– MirroredFate2011年07月11日 17:38:38 +00:00Commented Jul 11, 2011 at 17:38
-
@MirroredFate I understand. I, too, struggle with the variables, but many of them are only for temporary storage of values and are reused. Therefore, there are really no descriptive names for the variables.Hogan Whitaker– Hogan Whitaker2011年07月11日 18:09:15 +00:00Commented Jul 11, 2011 at 18:09
-
Yeah it's usually a good idea to check for higher level optimizations, but with names like "pointer", tmp0-3 and so on nobody will waste his or her time to try and understand the code. I'd implement the same code using SSE in C++ as a baseline performanceVoo– Voo2011年07月11日 18:13:47 +00:00Commented Jul 11, 2011 at 18:13
3 Answers 3
I can't see anything obvious. In addition to what Alex said there are two small suggestions which might possibly help:
1) The long if
statement in the first loop has a number of failure conditions. Have you ordered it so the one most likely to fail comes first? With short-circuit evaluation, the earlier you can find a false
the less work there is to do to evaluate the whole expression.
2) You are declaring a lot of variables outside the two for-loops, and I can see why you have done that. It is possible that the JVM will be more able to optimise things if you move the declarations inside the two loops so the variables are declared as locally as possible.
For both of these you need to do some timing runs to see if they make a real difference. You might also want to use a profiler to see where the code spends most of its time.
I have one other comment. In lines like:
data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
you are using >>
rather than >>>
to bitshift a possibly negative number. Are you sure that is what you want to do if tmp3> tmp10?
1 Comment
Without specific evidence of performance problems I can see couple of possible improvements (although not sure how much improvement you will gain from them).
- Replace repeating calculations with variables. For example
pointer+7
repeats few times. You can calculate it once. - Use
System.arraycopy()
to copy array at the end.
5 Comments
pointer+7
but all duplicates. There are quire few. However, I agree with @Neil that it will probably be taken care of by JIT and improvement will not be that big. But I would still try it out.Like other posters, I think there's very little you can do to optimise this as such.
Just out of desparation, I would possibly try reading all of the values (dataBlockBuffer[x]) into local variables BEFORE the condition, and then changing your condition to:
if ((data0 | data1 | data2 | ...) == 0) ...
That way, you have potentially fewer branches and where the condition fails (presumably most of the time?) the data is already in local variables ready.
But that said, I don't think you're going to shave very much off.
Another very marginal thing to consider is that wrapping a ByteBuffer around an array allows you to potentially read/write several values at once if you can see a way to do this. (At least when writing 'dcValue' multiple times you could write a long instead in case the JIT compiler doesn't make this optimisation).
On a desktop, I might have considered:
- multithreading to process multiple blocks at once (most people have at least dual core these days, but guess maybe not on Android);
- looking at the JIT-compiled output and making sure the JIT compiler isn't doing anything overtly silly, and then re-working code accordingly (you can do this with the debug JDK, but don't know if there's an equivalent for Android);
- the possibility of nativising, and then using the GPU/whatever judicious pragmas your C compiler might have to compile to particular SSE/equivalent instructions (I'm not an expert in this and don't know how possible it would be in your case).
I would consider the last two as kind of "extreme" options anyway-- with a risk of investing significant time for little gain.
Comments
Explore related questions
See similar questions with these tags.