Skip to main content
Code Review

Return to Answer

replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

added 72 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assemblyin assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

deleted 18 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

  • You can simplify your x/y loop to run in one dimension instead of two-- for ( i = 0; i < y * x * c; i += 4 ). Since you're looking at the whole image, you don't need to worry about the stride. Not only will this reduce the number of operations required, but you may get better performance from your CPU due to better pipelining and branch prediction.

  • If you can, use a lower color depth (I don't think you need 24 bits of color depth if you're just computing an average). The smaller storage size will yield a smaller memory area to scan. You will have to shift bits around to do the math, but that sort of thing is faster than memory access.

  • You could try resizing or scaling the bitmap to something lower rez. The resize operation will interpolate color. In theory you could scale it to a 1x1 image and just read that one pixel. If you use GDI+ to perform the scale it could use hardware acceleration and be very fast.

  • Keep a copy of the last bitmap and its totals. Use REPE CMPSD (yes, this is assembly) to compare your new bitmap to the old one and find non-matching cells. Adjust totals and recompute average. This is probably a little harder than it sounds but the scan would be incredibly fast. This option would work better if most pixels are expected to stay the same from frame to frame.

  • Do the entire scan in assembly, four bytes at a time. DWord operations, believe it or not, are faster than byte operations, for a modern CPU. You can get the byte you need through bit shifting, which take very few clock cycles. Been a while for me, but would look something like this:

     MOV ECX, ArrayLength ;ECX is our counter (= bytecount ÷ 4)
     MOV EDX, Scan0 ;EDX is our data pointer
     SUB BX, BX ;Set BX to 0 for later
     Loop:
     LODSL ;Load EAX from array, increment pointer
     SHRL 8, EAX ;Dump the least eight bits
     ADDB GreenTotal, AL ;Add least 8 bits to green total
     ADCW GreenTotal+1,BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB BlueTotal, AL ;Add least 8 bits to blue total
     ADCW BlueTotal+1, BX ;Carry the 1
     SHRL 8, EAX ;Shift right 8 more bits
     ADDB RedTotal, AL ;Add least 8 bits to red total
     ADCW RedTotal+1, BX ;Carry the 1
     LOOPNZ Loop ;Decrement CX and keep going until it is zero
    

    If the assembly is too much to take on, you can try to do the same in C++ and maybe the compiler will do a pretty good job of it. At the very least, we have gotten rid of all of your multiplication operations (which can take up 5-20x the number of clock cycles compared to a shift), two of your loops, and a whole bunch of if conditionals (which would mess up your CPU's branch prediction). Also we will get nice big cache bursts regardless of the dword alignment of the byte buffer, because it is a single-dimensional contiguous blob.

deleted 18 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 14 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 233 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 925 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 925 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 12 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 285 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 1 character in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 1 character in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 1 character in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
added 429 characters in body
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
Source Link
John Wu
  • 641
  • 1
  • 4
  • 8
Loading
lang-cs

AltStyle によって変換されたページ (->オリジナル) /