I need to be able to atomically add a 32-bit floating point value to a location in memory. Here is what I came up with. While the code is Windows-specific, I'll extend it with Linux support using __sync_bool_compare_and_swap()
.
While I tested this code and it appeared to work fine, I'll appreciate a fresh look from another pair of eyes to make sure this code is 100% safe.
I'm also very much interested in any performance advice as this code is used in a hot code path in a performance-sensitive application (interactive rendering).
atomic_float_add()
, the main function:
__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
assert(is_aligned(ptr, 4));
volatile LONG* lptr = reinterpret_cast<volatile LONG*>(ptr);
LONG lorg, lnew;
do
{
const float forg = *ptr;
const float fnew = forg + operand;
lorg = binary_cast<LONG>(forg);
lnew = binary_cast<LONG>(fnew);
} while (InterlockedCompareExchange(lptr, lnew, lorg) != lorg);
}
binary_cast()
is implemented in such a way that it obeys strict aliasing rules:
template <typename Target, typename Source>
inline Target binary_cast(Source s)
{
BOOST_STATIC_ASSERT(sizeof(Target) == sizeof(Source));
union
{
Source m_source;
Target m_target;
} u;
u.m_source = s;
return u.m_target;
}
is_aligned()
:
template <typename T>
inline bool is_aligned(const T ptr, const size_t alignment)
{
assert(alignment > 0);
assert(is_pow2(alignment));
const uintptr_t p = (uintptr_t)ptr;
return (p & (alignment - 1)) == 0;
}
is_pow2()
:
template <typename T>
inline bool is_pow2(const T x)
{
return (x & (x - 1)) == 0;
}
Assembly output for atomic_float_add()
(Visual Studio 2013, full optimization):
lorg$ = 8
ptr$ = 8
lnew$ = 16
operand$ = 16
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z PROC
prefetchw BYTE PTR [rcx]
npad 13
$LL3@atomic_flo:
movss xmm0, DWORD PTR [rcx]
movss DWORD PTR lorg$[rsp], xmm0
addss xmm0, xmm1
mov eax, DWORD PTR lorg$[rsp]
movss DWORD PTR lnew$[rsp], xmm0
mov edx, DWORD PTR lnew$[rsp]
lock cmpxchg DWORD PTR [rcx], edx
jne SHORT $LL3@atomic_flo
ret 0
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z ENDP
EDIT: here's a bit of context on how this code is used:
Profile showing that atomic_float_add() is a hot spot
As you can see, atomic_float_add()
is taking the bulk of the time of FilteredTile::add()
, which itself is the most expensive function of the whole program in this particular setup, taking nearly 27% of the overall time.
2 Answers 2
@EOF commented that InterlockedCompareExchange
(and its gcc counterpart, __sync_val_compare_and_swap
) returns the initial value of the destination address. This allows to remove one memory load from the retry loop.
Here is the new version with this optimization:
__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
assert(is_aligned(ptr, 4));
volatile LONG* iptr = reinterpret_cast<volatile LONG*>(ptr);
LONG expected = *iptr;
while (true)
{
const float value = binary_cast<float>(expected);
const LONG new_value = binary_cast<LONG>(value + operand);
const LONG actual = InterlockedCompareExchange(iptr, new_value, expected);
if (actual == expected)
return;
expected = actual;
}
}
Here is the corresponding assembly:
?atomic_float_add@foundation@@YAXPECMM@Z PROC
mov eax, DWORD PTR [rcx]
mov r8, rcx
mov DWORD PTR value2ドル[rsp], eax
movss xmm0, DWORD PTR value2ドル[rsp]
addss xmm0, xmm1
movss DWORD PTR new_value1ドル[rsp], xmm0
mov edx, DWORD PTR new_value1ドル[rsp]
lock cmpxchg DWORD PTR [rcx], edx
je SHORT $LN16@atomic_float_add
npad 13
$LL3@atomic_float_add:
mov DWORD PTR value2ドル[rsp], eax
mov edx, eax
movss xmm0, DWORD PTR value2ドル[rsp]
addss xmm0, xmm1
movss DWORD PTR new_value1ドル[rsp], xmm0
mov ecx, DWORD PTR new_value1ドル[rsp]
lock cmpxchg DWORD PTR [r8], ecx
cmp eax, edx
jne SHORT $LL3@atomic_float_add
$LN16@atomic_float_add:
ret 0
?atomic_float_add@foundation@@YAXPECMM@Z ENDP
Interestingly, the compiler decided to do one iteration before entering the loop. I'm not sure I understand the benefit of this...
In order to not overlook the obvious, adding heuristics:
__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
if (operand == 0) {
return;
}
...
and
if (weight == 0) {
} else if (weight == 1) {
memcopy(ptr, values, e * sizeof(float));
ptr += e;
} else {
for (size_t i = 0, e = m_channel_count - 1; i < e; ++i)
{
*ptr++ += values[i] * weight;
}
}
Where the comparisons might add an epsilon.
It depends on the frequencies of 0.0 and 1.0. Which I am afraid you tested already.
-
\$\begingroup\$ Hi, thanks for the input, appreciated! Unfortunately
operand
is almost never 0 or 1 so there's nothing to gain here... \$\endgroup\$François Beaune– François Beaune2016年08月11日 06:53:37 +00:00Commented Aug 11, 2016 at 6:53
Explore related questions
See similar questions with these tags.
std::atomic<>
? \$\endgroup\$std::atomic
. This is beyond our control. In any case, C++11's and Boost's atomics require a type change which is not acceptable in our context. \$\endgroup\$InterlockedCompareExchange
, something with weaker semantics maybe? \$\endgroup\$InterlockedCompareExchange
returns the current value, you do not have to load fromptr
in the loop if you reverse one of thebinary_cast
s. \$\endgroup\$