Issue 1148042: ARM Neon optimization for S32A_Opaque_BlitRow32

agl

The .S file appears to have a BSD like license at the top of it. ...

15 years, 8 months ago (2010年05月12日 21:25:29 UTC) #2

The .S file appears to have a BSD like license at the top of it. However, Skia
is Apache licensed. The boilerplate should be copied from an existing Skia file,
such as src/core/SkPaint.cpp.
Skia probably has closer ties to Android than Chromium, so you should confirm
that you've filled out the agreement first:
http://source.android.com/license/individual-contributor-license---android-op...
After that, I'll try and find someone to review the ARM code.
Cheers

reed

Skia has compliance and performance tests that can be run using the proposed patch, in ...

15 years, 8 months ago (2010年05月14日 15:06:23 UTC) #3

Skia has compliance and performance tests that can be run using the
proposed patch, in addition to reviewing the source code.
mike
On Wed, May 12, 2010 at 5:25 PM, <agl@chromium.org> wrote:
> The .S file appears to have a BSD like license at the top of it.
> However, Skia is Apache licensed. The boilerplate should be copied from
> an existing Skia file, such as src/core/SkPaint.cpp.
>
> Skia probably has closer ties to Android than Chromium, so you should
> confirm that you've filled out the agreement first:
>
http://source.android.com/license/individual-contributor-license---android-op...
>
> After that, I'll try and find someone to review the ARM code.
>
>
> Cheers
>
> http://codereview.appspot.com/1148042/show
>

XinQi

Hi Mike, We've used the following methods to verify the patch: 1. I printed out ...

15 years, 8 months ago (2010年05月17日 19:32:44 UTC) #4

Hi Mike,
We've used the following methods to verify the patch:
1. I printed out testing vector in Chromium browser code (when browsing), and
verifying in simulation environment that new neon code is generating the same
output as old one. Also the cycle number gain.
2. We've merged the patch to arm build of Chromium browser, verifying that it
works. (I once had a bug that read more through to the next line, you will see
that when rendering nytimes.com, the "New York Times" picture is not rendered
correctly, also other small images. So by displaying them correctly, it is sort
of verifying that it works as expected ) 
3. I've also tried the Skia bench with the source tree, which doesn't seems to
make much difference. My guess was that this specific call is not exercised much
in "bench" application.
Please let me know more details about the compliance and performance tests if
you want me to run them.
Thanks,
Xin
On 2010年05月14日 15:06:23, reed wrote:
> Skia has compliance and performance tests that can be run using the
> proposed patch, in addition to reviewing the source code.
> 
> mike

XinQi

Update license in header

15 years, 7 months ago (2010年06月01日 16:58:49 UTC) #5

Update license in header

agl

http://codereview.appspot.com/1148042/diff/6001/7001 File opts/S32A_Opaque_BlitRow32_neon2.S (right): http://codereview.appspot.com/1148042/diff/6001/7001#newcode18 opts/S32A_Opaque_BlitRow32_neon2.S:18: .fpu neon Assembly code always needs a stonking lot ...

15 years, 7 months ago (2010年06月03日 21:47:41 UTC) #6

http://codereview.appspot.com/1148042/diff/6001/7001
File opts/S32A_Opaque_BlitRow32_neon2.S (right):
http://codereview.appspot.com/1148042/diff/6001/7001#newcode18
opts/S32A_Opaque_BlitRow32_neon2.S:18: .fpu neon
Assembly code always needs a stonking lot of comments. Feel free to include the
C prototype of the function, inputs, outputs, what its doing, etc etc.
http://codereview.appspot.com/1148042/diff/6001/7002
File opts/SkBlitRow_opts_arm.cpp (right):
http://codereview.appspot.com/1148042/diff/6001/7002#newcode434
opts/SkBlitRow_opts_arm.cpp:434: extern "C" void
S32A_Opaque_BlitRow32_neon2(SkPMColor* SK_RESTRICT dst,
fix double space after "C"
http://codereview.appspot.com/1148042/diff/6001/7002#newcode560
opts/SkBlitRow_opts_arm.cpp:560:
#define	S32A_Opaque_BlitRow32_PROC	S32A_Opaque_BlitRow32_neon2
there are a couple of tabs in this line. Only spaces should be used.

XinQi

Update patch upon comments

15 years, 7 months ago (2010年06月15日 23:34:06 UTC) #7

Update patch upon comments

agl

r578

15 years, 7 months ago (2010年06月16日 19:53:46 UTC) #8

r578

ray.essick

I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even wider operations if gcc had ...

15 years, 7 months ago (2010年06月16日 22:28:52 UTC) #9

I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even wider
operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd been
using did a lot of extra register spills when I last tried to get really wide --
perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.

XinQi

Could you expand what you mean? Use inline assembly to implement it or to use ...

15 years, 7 months ago (2010年06月17日 00:36:31 UTC) #10

Could you expand what you mean? Use inline assembly to implement it or to use
intrinsic?
On 2010年06月16日 22:28:52, ray.essick wrote:
> I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with even
wider
> operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd been
> using did a lot of extra register spills when I last tried to get really wide
--
> perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.

ray.essick

sure, i can expand/explain where i'm coming from. I wrote a lot of that neon ...

15 years, 7 months ago (2010年06月17日 03:40:15 UTC) #11

sure, i can expand/explain where i'm coming from. I wrote a lot of that 
neon code -- basically all the pieces using intrinsics. i'm a big fan of 
the intrinsics -- let the compiler manage all the register assignments, 
scheduling instructions and such. and I think that they are easier to 
maintain than hand assembly.
your comments indicate that you're getting 70% improvement (i take this 
as "almost 2x the speed"). that's pretty good considering that the 
first neon version was alread 3x faster than the scalar version. That 
extra bit would have me thinking about whether to bend my natural 
inclination to stay with the intrinsics.
as to what I was thinking to try ...
i was thinking perhaps to expand the int8x8's that i'd been using into 
int8x16's [with appropriate fattening of the other
intermediate types i'd used]. but as I re-read my code, i see that i 
widen my int8x8's to int16x8 -- so i'd need to widen an int8x16 to 
int16x16 and we don't have that data type.
It appears that the place I was having the ugly register spills was 
where I'd wanted to use intrinsics to generate some vld4.8 instructions 
in the S32A_D565_Opaque_Dither_neon routine. so my memory is doubly bad 
this evening. You can see there how I worked around the limitations in 
gcc to get clean vld4.8 generated code; the register allocation worked 
nicely and I wasn't seeing any strange register motion.
I might play the same trick that I used for the vld4.8's to get a 
"vld1.32 {d0,d1,d2,d3}" so that we get the big "wider loads are better" 
advantage; i'd have to code it to see whether it would hold together 
through the actual operations once it's in registers.
And your assembly code does do better with loading for the next round 
while finishing this round; i never mastered that with the intrinsics. I 
wouldn't be surprised if it reduces to that.
make sure to build & run skia_test -- the last go round of changes that 
I did for neon with Mike involved changes to track some updated 
semantics of SkAlpha255To256(). this took a little bit to get right -- 
and I think that the semantics of that routine might have changed back. 
skia_test was good for ferreting out that this wasn't right. you'll want 
to check external/skia/tests/Android.mk to make sure skia_test is being 
built for you.
and to be clear -- I haven't seen anything wrong in the code you posted 
(haven't worked through it either); it was more of a "have we exhausted 
what we could do with intrinsics" comment.
-- Ray Essick
According to XinQi@codeaurora.org on 06/16/10 19:36:
> Could you expand what you mean? Use inline assembly to implement it or
> to use intrinsic?
>
> On 2010年06月16日 22:28:52, ray.essick wrote:
>> I'd have written the version in opts/SkBlitRow_opts_arm.cpp run with
> even wider
>> operations if gcc had generated nicer code. The gcc 4.3.2 compiler I'd
> been
>> using did a lot of extra register spills when I last tried to get
> really wide --
>> perhaps the gcc 4.4.x stuff fixes that; it might be worth checking.
>
>
>
> http://codereview.appspot.com/1148042/show

XinQi

Ray, Thanks for the detailed comments. * Using testing vector dumped in chromium browser loading ...

15 years, 7 months ago (2010年06月17日 18:54:29 UTC) #12

Ray,
Thanks for the detailed comments.
* Using testing vector dumped in chromium browser loading nytimes.com, this
version uses 76% less cycles in simulator. So it is about 4x speed :-). Of
course, it depends on the specific hardware pipelining, and model of memory
latency. And we also didn't see dramatic improvements in real usage case as
page loading. 
* You are very much right that I have struggled with SkAlpha255To256 too.
dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src)))
and
dst = src + (dst * ((255-a) + (255-a)>>7)) >> 8
May have one bit difference. I am using the second algo in 8 points chunk, and
only use the first one in the last len%8 points. So it will not be bit exact. 
I had assumed that 1/255 difference in color should be OK, but I checked your
latest code, it seems that you changed all to use (256-a) in 16 bits channel. Is
it prefered way? I can change the code too.
* I was using chromium browser on chromiumOS to test it. I will run skia_test
on Android env
* One big part of the gain is to process 8 points at a chunk instead of 4
points. I didn't check but I believe there should be same intrinsic to do the
work.
* My former experience using intrinsic/inline assembly is that it is easy to
start with, but we may have to spend more time understanding why gcc is not
generating exactly what we are expecting. I am also not sure whether you can
tune for the hardware pipeline using intrinsics because gcc may have the freedom
to shift the instruction based on its standards of data dependency and register
allocation. Pure assembly is more clean, but you do have to take care of more
things as ABI and register allocation. Just personal preference and depends on
the complexity of the work.
Happy coding!
Xin

agl

On Thu, Jun 17, 2010 at 2:54 PM, <XinQi@codeaurora.org> wrote: > May have one bit ...

15 years, 7 months ago (2010年06月17日 18:58:09 UTC) #13

On Thu, Jun 17, 2010 at 2:54 PM, <XinQi@codeaurora.org> wrote:
> May have one bit difference. I am using the second algo in 8 points
> chunk, and only use the first one in the last len%8 points. So it will
> not be bit exact. I had assumed that 1/255 difference in color should
> be OK, but I checked your latest code, it seems that you changed all to
> use (256-a) in 16 bits channel. Is it prefered way? I can change the
> code too.
I haven't rolled this change into Chromium yet. However, if it causes
pixel differences it will fail the layout tests and be reverted.
AGL

agl

On Thu, Jun 17, 2010 at 2:58 PM, Adam Langley <agl@chromium.org> wrote: > I haven't ...

15 years, 7 months ago (2010年06月18日 15:34:38 UTC) #14

On Thu, Jun 17, 2010 at 2:58 PM, Adam Langley <agl@chromium.org> wrote:
> I haven't rolled this change into Chromium yet. However, if it causes
> pixel differences it will fail the layout tests and be reverted.
Reverted. Broke the build. (Possibly just the new files were missing
from the build.)
AGL

XinQi

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

15 years, 7 months ago (2010年06月18日 18:28:49 UTC) #15

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

XinQi

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

15 years, 7 months ago (2010年06月18日 18:45:03 UTC) #16

Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

XinQi

* Update algorithm to use dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src))) it is ...

15 years, 7 months ago (2010年06月18日 18:50:27 UTC) #17

* Update algorithm to use 
dst = src + SkAlphaMulQ(dst, SkAlpha255To256(255 - SkGetPackedA32(src)))
it is not a big change because we have to expand to 16 bits lane to do
multiplication anyway.
 
* Have verified that this function has bit exact output as original
S32A_Opaque_BlitRow32_neon on Chromium testing vector.
* Have also verified that Android skia_test passes with this change.
* This patch is based on the reverted r582. So it is the full patch. Removed
space at the end of the lines too.
On 2010年06月18日 18:45:03, XinQi wrote:
> Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon

XinQi

AGL, We need the following updates on chromium/src/skia/skia.gyp for the new added .S files. Thanks, ...

15 years, 7 months ago (2010年06月18日 18:56:42 UTC) #18

AGL,
We need the following updates on chromium/src/skia/skia.gyp for the new added .S
files. 
Thanks,
Xin
diff --git a/skia.gyp b/skia.gyp
index 2b17094..4305fa4 100644
--- a/skia.gyp
+++ b/skia.gyp
@@ -726,7 +726,13 @@
 'sources': [
 '../third_party/skia/src/opts/SkBitmapProcState_opts_arm.cpp',
 '../third_party/skia/src/opts/SkBlitRow_opts_arm.cpp',
- '../third_party/skia/src/opts/SkUtils_opts_none.cpp',
+ '../third_party/skia/src/opts/S32A_Opaque_BlitRow32_neon2.S', 
+ '../third_party/skia/src/opts/S32_Opaque_D32_nofilter_DX_gather.S',
+ '../third_party/skia/src/opts/xfer.S', 
+ '../third_party/skia/src/opts/memset16_neon.S', 
+ '../third_party/skia/src/opts/memset32_neon.S', 
+ '../third_party/skia/src/opts/opts_check_arm_neon.cpp',
+ '../third_party/skia/src/opts/t32cb16blend.S', 
 ],
 }],
 ],

RSS Feeds Recent Issues | This issue

This is Rietveld f62528b

Issue 1148042: ARM Neon optimization for S32A_Opaque_BlitRow32

Patch Set 1 #

Patch Set 2 : Update license in header #

Patch Set 3 : Update patch upon comments #

Patch Set 4 : Make S32A_Opaque_BlitRow32_neon2 bit exact as S32A_Opaque_BlitRow32_neon #