Alternative XMLoadFloat3A implementation #231

New issue

Open

Labels

optimization

@MikeMarcin

Description

@MikeMarcin

MikeMarcin

opened

on Feb 23, 2025

For SSE targets XMLoadFloat3A will do

__m128 V = _mm_load_ps(&pSource->x);
return _mm_and_ps(V, g_XMMask3);

And when compiling for AVX this will generate code that looks like either

vmovups xmm2,xmmword ptr [DirectX::g_XMMask3 (07FF78A593DD0h)] 
vandps xmm3,xmm2,xmmword ptr [rcx] 
; or 
vmovups xmm0, XMMWORD PTR [rcx]
vandps xmm0, xmm0, XMMWORD PTR XMVECTORU32 const g_XMMask3

Consider instead doing

__m128 V = _mm_load_ps(&pSource->x);
return _mm_blend_ps(_mm_setzero_ps(), V, 0b0111);

This avoids the memory load of gXMMask3 and generates the slightly more efficient

vxorps xmm0, xmm0, xmm0
vblendps xmm0, xmm0, XMMWORD PTR [rcx], 7

I would like to suggest the same for XMLoadFloat3 but there are edge cases where you could get access violations for reading that extra float (even though it is masked out in the blend). I would be fine with that tradeoff to replace the VMOVSD -> VINSERTPS with XORPS -> VBLENDPS but I can imagine as a general purpose library erring on the side of caution.

Metadata

Assignees

No one assigned

Labels

optimization

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Alternative XMLoadFloat3A implementation #231

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions