Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 12:40

Community Bot

edited May 23, 2017 at 12:40

Community Bot

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

added 72 characters in body

Source Link

edited Mar 14, 2017 at 17:50

John Wu

edited Mar 14, 2017 at 17:50

John Wu

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assemblyin assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

deleted 18 characters in body

Source Link

edited Mar 14, 2017 at 10:22

John Wu

edited Mar 14, 2017 at 10:22

John Wu

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.
If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.
You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.
Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.
Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:
```
 MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
 MOV EDX, Scan0 ;EDX is our data pointer
 SUB BX, BX ;Set BX to 0 for later
 Loop:
 LODSL ;Load EAX from array, increment pointer
 SHRL 8, EAX ;Dump the least eight bits
 ADDB GreenTotal, AL ;Add least 8 bits to green total
 ADCW GreenTotal+1,BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB BlueTotal, AL ;Add least 8 bits to blue total
 ADCW BlueTotal+1, BX ;Carry the 1
 SHRL 8, EAX ;Shift right 8 more bits
 ADDB RedTotal, AL ;Add least 8 bits to red total
 ADCW RedTotal+1, BX ;Carry the 1
 LOOPNZ Loop ;Decrement CX and keep going until it is zero
```
If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.