I'm especially interested in ways to further optimize a constraint algorithm. So far I've switched to SIMD sse4 vector operations which nearly doubled the performance. I'm hoping there is more performance to squeeze out that I simply don't see.
Here is the algorithm:
FVector4 rest(REST_DISTANCE); // Set the constant simd vector outside the loop
FVector4 stiffness(STIFFNESS);
for (int i = 0; i < TEST_SIZE; i++)
{
const auto& constraint = Constraints[i];
auto& position1 = Particle[constraint.first].Position;
auto& position2 = Particle[constraint.second].Position;
for (int j = 0; j < ITERATION_COUNT; j++)
{
auto delta = position2 - position1;
auto distance = delta.norm();
auto correctionDistance = (distance - rest) / distance;
auto pCorrection = correctionDistance * delta * stiffness;
position1 += pCorrection;
position2 -= pCorrection;
}
}
Constraints
is an array of a simple struct that holds two unsigned int's, Particle
is an array of FParticle
structs which holds three FVector4
fields and a Fvector4
simply wraps a __m128
and provides operators and functions like norm()
which utilize _mm_*
intrinsic functions.
Currently the algorithm takes 55 cycles per Constraint per Iteration, this is with TEST_SIZE
of 150,000 and ITERATION_COUNT
at 16. I know there is performance to gain from making it parallel but I'm hoping to get more performance out of the single threaded version specifically.
I can post the actual structs if needed.
FVector4 code:
struct FVector4
{
FVector4()
{}
explicit FVector4(const __m128& InitData) : Data(InitData)
{
}
explicit FVector4(__m128&& InitData) : Data(InitData)
{
}
explicit FVector4(float InitValue)
{
Data = _mm_set_ps1(InitValue);
}
explicit FVector4(float X, float Y, float Z, float W = 0.f)
{
Data = _mm_setr_ps(X, Y, Z, W);
}
__forceinline float& operator[](int index)
{
return Data.m128_f32[index];
}
__forceinline const float& operator[](int index) const
{
return Data.m128_f32[index];
}
__forceinline __m128 squaredNorm() const
{
return _mm_dp_ps(Data, Data, 0x7F);
}
__forceinline __m128 norm() const
{
auto sqNorm = squaredNorm();
return _mm_mul_ps(sqNorm, _mm_rsqrt_ps(sqNorm));
}
__forceinline __m128 normalized() const
{
auto dp = _mm_dp_ps(Data, Data, 0x7F);
dp = _mm_rsqrt_ps(dp);
return _mm_mul_ps(Data, dp);
}
__forceinline void normalize()
{
Data = normalized();
}
__m128 Data;
};
Operators as used in the algorithm:
__forceinline FVector4 operator-(const FVector4& a, const FVector4& b)
{
return FVector4(_mm_sub_ps(a.Data, b.Data));
}
__forceinline FVector4 operator*(const FVector4& a, const FVector4& b)
{
return FVector4(_mm_mul_ps(a.Data, b.Data));
}
__forceinline FVector4 operator/(const FVector4& a, const FVector4& b)
{
return FVector4(_mm_mul_ps(a.Data, _mm_rcp_ps(b.Data)));
}
__forceinline FVector4& operator+=(FVector4& a, const FVector4& b)
{
a.Data = _mm_add_ps(a.Data, b.Data);
return a;
}
__forceinline FVector4& operator-=(FVector4& a, const FVector4& b)
{
a.Data = _mm_sub_ps(a.Data, b.Data);
return a;
}
2 Answers 2
I don't know much about SSE and can't offer you any suggestions on performance right now, just a few suggestions to improve your code a little bit. Nevertheless, it looks well written and clean.
Since the default constructor does nothing, you can make it
= default
. That's the new syntax for such cases since C++11. It also conveys more meaning that you want a default no-op constructor.The move constructor is pointless. An
__m128
is a native type like an integer, so it fits in a CPU register. It will always be copied. You can remove this constructor overload.The other constructor that takes a
__m128
by const reference is probably less efficient that it could be. Again,__m128
can be passed via registers, so taking by reference will instead pass a pointer that has to be dereferenced. I bet you it is generating more code than if you'd just pass it by value.The constructor that takes the four X,Y,Z,W floats doesn't have to be
explicit
. For any constructor taking more than one argument that has no default,explicit
doesn't make sense. It is only meaningful when the constructor has a single required argument and you want to force writing something likefunc(FVector{42.0f})
. If the constructor takes more than one argument, it can't be implicitly called.const float& operator[](int index) const
should instead return by value. Again you are forcing the compiler to generate indirection code when the value can be cheaply passed in a register.I would also recommend
assert
ing in the array subscript operators to ensureindex
in>= 0 && < 4
.If you decide to port this code in the future,
__forceinline
is going to be a problem. It is a Microsoft extension. I would#define
a portable macro for it instead or just use standardinline
.__forceinline
is no guarantee of inlining, despite the misleading name. It just provides a slightly stronger hint to the compiler thaninline
.__m128
is somewhat an ugly type to pass around, with those double underscores. In your place, I would declare some alias for it, perhapssimd_t
or something. It might even be of value in the future if you decide to portFVector
to some other SIMD set like AltiVec.
auto delta = position2 - position1;
auto distance = delta.norm();
auto correctionDistance = (distance - rest) / distance;
auto pCorrection = correctionDistance * delta * stiffness;
I would mark these all as const auto
as they aren't changing during each loop cycle and might enable the compiler to make certain optimizations it couldn't do before as well as express their const
ness to the reader.
EDIT: If you're willing to make certain assumptions about ITERATION_COUNT
, you might be able to do a little manual loop unrolling.
-
\$\begingroup\$ True, I'll make those changes and try it. Yes, the iteration count will never go below 4 so I could unroll it at least that many times. \$\endgroup\$NIZGTR– NIZGTR2016年01月06日 18:10:41 +00:00Commented Jan 6, 2016 at 18:10