Java - Optimize function that contains mostly simple maths

Question 1

I'm supposed to try and optimize this method for my team, working on a video decoder in Java, though I don't see any good approach to doing so. The function, below, doesn't seem like it could be speed up in any significant amount, since it contains mostly simple addition/subtraction/etc.

void inverseTransform(int macroBlockIndex, int dataBlockIndex) {
 int[] workSpace = new int[64];
 short[] data = new short[64];
 int z1, z2, z3, z4, z5;
 int tmp0, tmp1, tmp2, tmp3;
 int tmp10, tmp11, tmp12, tmp13;
 int pointer = 0;
 for (int index = 8; index > 0; index--) {
 if (dataBlockBuffer[pointer + 8] == 0 && dataBlockBuffer[pointer + 16] == 0 && dataBlockBuffer[pointer + 24] == 0 && dataBlockBuffer[pointer + 32] == 0 && dataBlockBuffer[pointer + 40] == 0 && dataBlockBuffer[pointer + 48] == 0 && dataBlockBuffer[pointer + 56] == 0) {
 int dcValue = dataBlockBuffer[pointer] << PASS1_BITS;
 workSpace[pointer + 0] = dcValue;
 workSpace[pointer + 8] = dcValue;
 workSpace[pointer + 16] = dcValue;
 workSpace[pointer + 24] = dcValue;
 workSpace[pointer + 32] = dcValue;
 workSpace[pointer + 40] = dcValue;
 workSpace[pointer + 48] = dcValue;
 workSpace[pointer + 56] = dcValue;
 pointer++;
 continue;
 }
 z2 = dataBlockBuffer[pointer + 16];
 z3 = dataBlockBuffer[pointer + 48];
 z1 = (z2 + z3) * FIX_0_541196100;
 tmp2 = z1 + z3 * -FIX_1_847759065;
 tmp3 = z1 + z2 * FIX_0_765366865;
 z2 = dataBlockBuffer[pointer];
 z3 = dataBlockBuffer[pointer + 32];
 tmp0 = (z2 + z3) << BITS;
 tmp1 = (z2 - z3) << BITS;
 tmp10 = tmp0 + tmp3;
 tmp13 = tmp0 - tmp3;
 tmp11 = tmp1 + tmp2;
 tmp12 = tmp1 - tmp2;
 tmp0 = dataBlockBuffer[pointer + 56];
 tmp1 = dataBlockBuffer[pointer + 40];
 tmp2 = dataBlockBuffer[pointer + 24];
 tmp3 = dataBlockBuffer[pointer + 8];
 z1 = tmp0 + tmp3;
 z2 = tmp1 + tmp2;
 z3 = tmp0 + tmp2;
 z4 = tmp1 + tmp3;
 z5 = (z3 + z4) * FIX_1_175875602;
 tmp0 = tmp0 * FIX_0_298631336;
 tmp1 = tmp1 * FIX_2_053119869;
 tmp2 = tmp2 * FIX_3_072711026;
 tmp3 = tmp3 * FIX_1_501321110;
 z1 = z1 * -FIX_0_899976223;
 z2 = z2 * -FIX_2_562915447;
 z3 = z3 * -FIX_1_961570560;
 z4 = z4 * -FIX_0_390180644;
 z3 += z5;
 z4 += z5;
 tmp0 += z1 + z3;
 tmp1 += z2 + z4;
 tmp2 += z2 + z3;
 tmp3 += z1 + z4;
 workSpace[pointer + 0] = ((tmp10 + tmp3 + (1 << F1)) >> F2);
 workSpace[pointer + 56] = ((tmp10 - tmp3 + (1 << F1)) >> F2);
 workSpace[pointer + 8] = ((tmp11 + tmp2 + (1 << F1)) >> F2);
 workSpace[pointer + 48] = ((tmp11 - tmp2 + (1 << F1)) >> F2);
 workSpace[pointer + 16] = ((tmp12 + tmp1 + (1 << F1)) >> F2);
 workSpace[pointer + 40] = ((tmp12 - tmp1 + (1 << F1)) >> F2);
 workSpace[pointer + 24] = ((tmp13 + tmp0 + (1 << F1)) >> F2);
 workSpace[pointer + 32] = ((tmp13 - tmp0 + (1 << F1)) >> F2);
 pointer++;
 }
 pointer = 0;
 for (int index = 0; index < 8; index++) {
 z2 = workSpace[pointer + 2];
 z3 = workSpace[pointer + 6];
 z1 = (z2 + z3) * FIX_0_541196100;
 tmp2 = z1 + z3 * -FIX_1_847759065;
 tmp3 = z1 + z2 * FIX_0_765366865;
 tmp0 = (workSpace[pointer + 0] + workSpace[pointer + 4]) << BITS;
 tmp1 = (workSpace[pointer + 0] - workSpace[pointer + 4]) << BITS;
 tmp10 = tmp0 + tmp3;
 tmp13 = tmp0 - tmp3;
 tmp11 = tmp1 + tmp2;
 tmp12 = tmp1 - tmp2;
 tmp0 = workSpace[pointer + 7];
 tmp1 = workSpace[pointer + 5];
 tmp2 = workSpace[pointer + 3];
 tmp3 = workSpace[pointer + 1];
 z1 = tmp0 + tmp3;
 z2 = tmp1 + tmp2;
 z3 = tmp0 + tmp2;
 z4 = tmp1 + tmp3;
 z5 = (z3 + z4) * FIX_1_175875602;
 tmp0 = tmp0 * FIX_0_298631336;
 tmp1 = tmp1 * FIX_2_053119869;
 tmp2 = tmp2 * FIX_3_072711026;
 tmp3 = tmp3 * FIX_1_501321110;
 z1 = z1 * -FIX_0_899976223;
 z2 = z2 * -FIX_2_562915447;
 z3 = z3 * -FIX_1_961570560;
 z4 = z4 * -FIX_0_390180644;
 z3 += z5;
 z4 += z5;
 tmp0 += z1 + z3;
 tmp1 += z2 + z4;
 tmp2 += z2 + z3;
 tmp3 += z1 + z4;
 data[pointer + 0] = (short) ((tmp10 + tmp3) >> F3);
 data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);
 data[pointer + 1] = (short) ((tmp11 + tmp2) >> F3);
 data[pointer + 6] = (short) ((tmp11 - tmp2) >> F3);
 data[pointer + 2] = (short) ((tmp12 + tmp1) >> F3);
 data[pointer + 5] = (short) ((tmp12 - tmp1) >> F3);
 data[pointer + 3] = (short) ((tmp13 + tmp0) >> F3);
 data[pointer + 4] = (short) ((tmp13 - tmp0) >> F3);
 pointer += 8;
 }
 short[] temp = imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex];
 for (int i = 0; i < data.length; i++)
 temp[i] = data[i]; //imageSlice.MacroBlocks[macroBlockIndex].DataBlocks[dataBlockIndex][i] = data[i];
}

Should I combine the basic math if I can, or what would you suggest?

Question 2

Most will suggest: Find where performance decrease is, create a smaller problem to display and re ask the question with that information

Question 3

Well, the two for loops take nearly the same amount of time so performance decreases in the method as a whole. I could have split the display into each individual for loop, but then it would be in two separate posts.

Question 4

Write it in assembler. On a serious note, though, I didn't see any shortcuts... but your completely non-descript variable names mean nothing to me, either, so I didn't exactly spend a ton of time trying to figure out how it was doing what it was doing.

Question 5

@MirroredFate I understand. I, too, struggle with the variables, but many of them are only for temporary storage of values and are reused. Therefore, there are really no descriptive names for the variables.

Question 6

Yeah it's usually a good idea to check for higher level optimizations, but with names like "pointer", tmp0-3 and so on nobody will waste his or her time to try and understand the code. I'd implement the same code using SSE in C++ as a baseline performance

Question 7

I can't see anything obvious. In addition to what Alex said there are two small suggestions which might possibly help:

1) The long if statement in the first loop has a number of failure conditions. Have you ordered it so the one most likely to fail comes first? With short-circuit evaluation, the earlier you can find a false the less work there is to do to evaluate the whole expression.

2) You are declaring a lot of variables outside the two for-loops, and I can see why you have done that. It is possible that the JVM will be more able to optimise things if you move the declarations inside the two loops so the variables are declared as locally as possible.

For both of these you need to do some timing runs to see if they make a real difference. You might also want to use a profiler to see where the code spends most of its time.

I have one other comment. In lines like:

data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);

you are using >> rather than >>> to bitshift a possibly negative number. Are you sure that is what you want to do if tmp3> tmp10?

Question 8

Thank you for your advice. I will look more into it.

Question 9

Without specific evidence of performance problems I can see couple of possible improvements (although not sure how much improvement you will gain from them).

Replace repeating calculations with variables. For example pointer+7 repeats few times. You can calculate it once.
Use System.arraycopy() to copy array at the end.

Question 10

By calculating pointer+7 only once will not speed the code enough to make a big difference, but thank you for the System.arraycopy().

Question 11

What kind of evidence do you need? The first for loop takes approximately 730 million nanoseconds, while the second for loop takes about 790 million, with dataBlockBuffer set as an array of size 64 with each spot set to the value of "1".

Question 12

The JIT compiler should do that anyway!

Question 13

@Hogan I didn't mean to only replace pointer+7 but all duplicates. There are quire few. However, I agree with @Neil that it will probably be taken care of by JIT and improvement will not be that big. But I would still try it out.

Question 14

@Alex Actually, replacing the duplicates will create more variables and take a little more time because the variables will then have to be updated for each loop within the two for loops.

Question 15

Like other posters, I think there's very little you can do to optimise this as such.

Just out of desparation, I would possibly try reading all of the values (dataBlockBuffer[x]) into local variables BEFORE the condition, and then changing your condition to:

if ((data0 | data1 | data2 | ...) == 0) ...

That way, you have potentially fewer branches and where the condition fails (presumably most of the time?) the data is already in local variables ready.

But that said, I don't think you're going to shave very much off.

Another very marginal thing to consider is that wrapping a ByteBuffer around an array allows you to potentially read/write several values at once if you can see a way to do this. (At least when writing 'dcValue' multiple times you could write a long instead in case the JIT compiler doesn't make this optimisation).

On a desktop, I might have considered:

multithreading to process multiple blocks at once (most people have at least dual core these days, but guess maybe not on Android);
looking at the JIT-compiled output and making sure the JIT compiler isn't doing anything overtly silly, and then re-working code accordingly (you can do this with the debug JDK, but don't know if there's an equivalent for Android);
the possibility of nativising, and then using the GPU/whatever judicious pragmas your C compiler might have to compile to particular SSE/equivalent instructions (I'm not an expert in this and don't know how possible it would be in your case).

I would consider the last two as kind of "extreme" options anyway-- with a risk of investing significant time for little gain.

rossum rossum 15.7k2 gold badges26 silver badges40 bronze badges · Accepted Answer · 2011-07-11 17:54:31Z

I can't see anything obvious. In addition to what Alex said there are two small suggestions which might possibly help:

1) The long if statement in the first loop has a number of failure conditions. Have you ordered it so the one most likely to fail comes first? With short-circuit evaluation, the earlier you can find a false the less work there is to do to evaluate the whole expression.

2) You are declaring a lot of variables outside the two for-loops, and I can see why you have done that. It is possible that the JVM will be more able to optimise things if you move the declarations inside the two loops so the variables are declared as locally as possible.

For both of these you need to do some timing runs to see if they make a real difference. You might also want to use a profiler to see where the code spends most of its time.

I have one other comment. In lines like:

data[pointer + 7] = (short) ((tmp10 - tmp3) >> F3);

you are using >> rather than >>> to bitshift a possibly negative number. Are you sure that is what you want to do if tmp3> tmp10?

CollectivesTM on Stack Overflow

Java - Optimize function that contains mostly simple maths

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

3 Answers 3

1 Comment

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related