1

I'm supposed to try and optimize this method for my team, working on a video decoder in Java, though I don't see any good approach to doing so. The function, below, doesn't seem like it could be speed up in any significant amount, since it contains mostly simple addition/subtraction/etc.

void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
 int[] workSpace = new int[64];
 short[] data = new short[64];
 int z1, z2, z3, z4, z5;
 int tmp0, tmp1, tmp2, tmp3;
 int tmp10, tmp11, tmp12, tmp13;
 int pointer = 0;
 for (int index = 8; index > 0; index--) {
 if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
 int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;
 workSpace[pointer + 0] = dcValue;
 workSpace[pointer + 8] = dcValue;
 workSpace[pointer + 16] = dcValue;
 workSpace[pointer + 24] = dcValue;
 workSpace[pointer + 32] = dcValue;
 workSpace[pointer + 40] = dcValue;
 workSpace[pointer + 48] = dcValue;
 workSpace[pointer + 56] = dcValue;
 pointer++;
 continue;
 }
 z2 = dataBlockBuffer[pointer + 16];
 z3 = dataBlockBuffer[pointer + 48];
 z1 = (z2 + z3) * FIX_0_541196100;
 tmp2 = z1 + z3 * -FIX_1_847759065;
 tmp3 = z1 + z2 * FIX_0_765366865;
 z2 = dataBlockBuffer[pointer];
 z3 = dataBlockBuffer[pointer + 32];
 tmp0 = (z2 + z3) << BITS;
 tmp1 = (z2 - z3) << BITS;
 tmp10 = tmp0 + tmp3;
 tmp13 = tmp0 - tmp3;
 tmp11 = tmp1 + tmp2;
 tmp12 = tmp1 - tmp2;
 tmp0 = dataBlockBuffer[pointer + 56];
 tmp1 = dataBlockBuffer[pointer + 40];
 tmp2 = dataBlockBuffer[pointer + 24];
 tmp3 = dataBlockBuffer[pointer + 8];
 z1 = tmp0 + tmp3;
 z2 = tmp1 + tmp2;
 z3 = tmp0 + tmp2;
 z4 = tmp1 + tmp3;
 z5 = (z3 + z4) * FIX_1_175875602;
 tmp0 = tmp0 * FIX_0_298631336;
 tmp1 = tmp1 * FIX_2_053119869;
 tmp2 = tmp2 * FIX_3_072711026;
 tmp3 = tmp3 * FIX_1_501321110;
 z1 = z1 * -FIX_0_899976223;
 z2 = z2 * -FIX_2_562915447;
 z3 = z3 * -FIX_1_961570560;
 z4 = z4 * -FIX_0_390180644;
 z3 += z5;
 z4 += z5;
 tmp0 += z1 + z3;
 tmp1 += z2 + z4;
 tmp2 += z2 + z3;
 tmp3 += z1 + z4;
 workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
 workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
 workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
 workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
 workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
 workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
 workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
 workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);
 pointer++;
 }
 pointer = 0;
 for (int index = 0; index < 8; index++) {
 z2 = workSpace[pointer + 2];
 z3 = workSpace[pointer + 6];
 z1 = (z2 + z3) * FIX_0_541196100;
 tmp2 = z1 + z3 * -FIX_1_847759065;
 tmp3 = z1 + z2 * FIX_0_765366865;
 tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
 tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;
 tmp10 = tmp0 + tmp3;
 tmp13 = tmp0 - tmp3;
 tmp11 = tmp1 + tmp2;
 tmp12 = tmp1 - tmp2;
 tmp0 = workSpace[pointer + 7];
 tmp1 = workSpace[pointer + 5];
 tmp2 = workSpace[pointer + 3];
 tmp3 = workSpace[pointer + 1];
 z1 = tmp0 + tmp3;
 z2 = tmp1 + tmp2;
 z3 = tmp0 + tmp2;
 z4 = tmp1 + tmp3;
 z5 = (z3 + z4) * FIX_1_175875602;
 tmp0 = tmp0 * FIX_0_298631336;
 tmp1 = tmp1 * FIX_2_053119869;
 tmp2 = tmp2 * FIX_3_072711026;
 tmp3 = tmp3 * FIX_1_501321110;
 z1 = z1 * -FIX_0_899976223;
 z2 = z2 * -FIX_2_562915447;
 z3 = z3 * -FIX_1_961570560;
 z4 = z4 * -FIX_0_390180644;
 z3 += z5;
 z4 += z5;
 tmp0 += z1 + z3;
 tmp1 += z2 + z4;
 tmp2 += z2 + z3;
 tmp3 += z1 + z4;
 data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
 data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
 data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
 data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
 data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
 data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
 data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
 data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);
 pointer += 8;
 }
 short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
 for (int i = 0; i < data.length; i++)
 temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}

Should I combine the basic math if I can, or what would you suggest?

asked Jul 11, 2011 at 16:52
5
  • 1
    Most will suggest: Find where performance decrease is, create a smaller problem to display and re ask the question with that information Commented Jul 11, 2011 at 16:55
  • Well, the two for loops take nearly the same amount of time so performance decreases in the method as a whole. I could have split the display into each individual for loop, but then it would be in two separate posts. Commented Jul 11, 2011 at 17:11
  • 1
    Write it in assembler. On a serious note, though, I didn't see any shortcuts... but your completely non-descript variable names mean nothing to me, either, so I didn't exactly spend a ton of time trying to figure out how it was doing what it was doing. Commented Jul 11, 2011 at 17:38
  • @MirroredFate I understand. I, too, struggle with the variables, but many of them are only for temporary storage of values and are reused. Therefore, there are really no descriptive names for the variables. Commented Jul 11, 2011 at 18:09
  • Yeah it's usually a good idea to check for higher level optimizations, but with names like "pointer", tmp0-3 and so on nobody will waste his or her time to try and understand the code. I'd implement the same code using SSE in C++ as a baseline performance Commented Jul 11, 2011 at 18:13

3 Answers 3

1

I can't see anything obvious. In addition to what Alex said there are two small suggestions which might possibly help:

1) The long if statement in the first loop has a number of failure conditions. Have you ordered it so the one most likely to fail comes first? With short-circuit evaluation, the earlier you can find a false the less work there is to do to evaluate the whole expression.

2) You are declaring a lot of variables outside the two for-loops, and I can see why you have done that. It is possible that the JVM will be more able to optimise things if you move the declarations inside the two loops so the variables are declared as locally as possible.

For both of these you need to do some timing runs to see if they make a real difference. You might also want to use a profiler to see where the code spends most of its time.

I have one other comment. In lines like:

data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);

you are using >> rather than >>> to bitshift a possibly negative number. Are you sure that is what you want to do if tmp3> tmp10?

answered Jul 11, 2011 at 17:54

1 Comment

Thank you for your advice. I will look more into it.
1

Without specific evidence of performance problems I can see couple of possible improvements (although not sure how much improvement you will gain from them).

  1. Replace repeating calculations with variables. For example pointer+7 repeats few times. You can calculate it once.
  2. Use System.arraycopy() to copy array at the end.
answered Jul 11, 2011 at 17:04

5 Comments

By calculating pointer+7 only once will not speed the code enough to make a big difference, but thank you for the System.arraycopy().
What kind of evidence do you need? The first for loop takes approximately 730 million nanoseconds, while the second for loop takes about 790 million, with dataBlockBuffer set as an array of size 64 with each spot set to the value of "1".
The JIT compiler should do that anyway!
@Hogan I didn't mean to only replace pointer+7 but all duplicates. There are quire few. However, I agree with @Neil that it will probably be taken care of by JIT and improvement will not be that big. But I would still try it out.
@Alex Actually, replacing the duplicates will create more variables and take a little more time because the variables will then have to be updated for each loop within the two for loops.
1

Like other posters, I think there's very little you can do to optimise this as such.

Just out of desparation, I would possibly try reading all of the values (dataBlockBuffer[x]) into local variables BEFORE the condition, and then changing your condition to:

if ((data0 | data1 | data2 | ...) == 0) ...

That way, you have potentially fewer branches and where the condition fails (presumably most of the time?) the data is already in local variables ready.

But that said, I don't think you're going to shave very much off.

Another very marginal thing to consider is that wrapping a ByteBuffer around an array allows you to potentially read/write several values at once if you can see a way to do this. (At least when writing 'dcValue' multiple times you could write a long instead in case the JIT compiler doesn't make this optimisation).

On a desktop, I might have considered:

  • multithreading to process multiple blocks at once (most people have at least dual core these days, but guess maybe not on Android);
  • looking at the JIT-compiled output and making sure the JIT compiler isn't doing anything overtly silly, and then re-working code accordingly (you can do this with the debug JDK, but don't know if there's an equivalent for Android);
  • the possibility of nativising, and then using the GPU/whatever judicious pragmas your C compiler might have to compile to particular SSE/equivalent instructions (I'm not an expert in this and don't know how possible it would be in your case).

I would consider the last two as kind of "extreme" options anyway-- with a risk of investing significant time for little gain.

answered Jul 11, 2011 at 18:26

Comments

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.