Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Alternative XMLoadFloat3A implementation #231

Open
@MikeMarcin

Description

For SSE targets XMLoadFloat3A will do

__m128 V = _mm_load_ps(&pSource->x);
return _mm_and_ps(V, g_XMMask3);

And when compiling for AVX this will generate code that looks like either

vmovups xmm2,xmmword ptr [DirectX::g_XMMask3 (07FF78A593DD0h)] 
vandps xmm3,xmm2,xmmword ptr [rcx] 
; or 
vmovups xmm0, XMMWORD PTR [rcx]
vandps xmm0, xmm0, XMMWORD PTR XMVECTORU32 const g_XMMask3

Consider instead doing

__m128 V = _mm_load_ps(&pSource->x);
return _mm_blend_ps(_mm_setzero_ps(), V, 0b0111);

This avoids the memory load of gXMMask3 and generates the slightly more efficient

vxorps xmm0, xmm0, xmm0
vblendps xmm0, xmm0, XMMWORD PTR [rcx], 7

I would like to suggest the same for XMLoadFloat3 but there are edge cases where you could get access violations for reading that extra float (even though it is masked out in the blend). I would be fine with that tradeoff to replace the VMOVSD -> VINSERTPS with XORPS -> VBLENDPS but I can imagine as a general purpose library erring on the side of caution.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

      Relationships

      None yet

      Development

      No branches or pull requests

      Issue actions

        AltStyle によって変換されたページ (->オリジナル) /