Atomic floating-point addition

Question 1

I need to be able to atomically add a 32-bit floating point value to a location in memory. Here is what I came up with. While the code is Windows-specific, I'll extend it with Linux support using __sync_bool_compare_and_swap().

While I tested this code and it appeared to work fine, I'll appreciate a fresh look from another pair of eyes to make sure this code is 100% safe.

I'm also very much interested in any performance advice as this code is used in a hot code path in a performance-sensitive application (interactive rendering).

atomic_float_add(), the main function:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 assert(is_aligned(ptr, 4));
 volatile LONG* lptr = reinterpret_cast<volatile LONG*>(ptr);
 LONG lorg, lnew;
 do
 {
 const float forg = *ptr;
 const float fnew = forg + operand;
 lorg = binary_cast<LONG>(forg);
 lnew = binary_cast<LONG>(fnew);
 } while (InterlockedCompareExchange(lptr, lnew, lorg) != lorg);
}

binary_cast() is implemented in such a way that it obeys strict aliasing rules:

template <typename Target, typename Source>
inline Target binary_cast(Source s)
{
 BOOST_STATIC_ASSERT(sizeof(Target) == sizeof(Source));
 union
 {
 Source m_source;
 Target m_target;
 } u;
 u.m_source = s;
 return u.m_target;
}

is_aligned():

template <typename T>
inline bool is_aligned(const T ptr, const size_t alignment)
{
 assert(alignment > 0);
 assert(is_pow2(alignment));
 const uintptr_t p = (uintptr_t)ptr;
 return (p & (alignment - 1)) == 0;
}

is_pow2():

template <typename T>
inline bool is_pow2(const T x)
{
 return (x & (x - 1)) == 0;
}

Assembly output for atomic_float_add() (Visual Studio 2013, full optimization):

lorg$ = 8
ptr$ = 8
lnew$ = 16
operand$ = 16
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z PROC
 prefetchw BYTE PTR [rcx]
 npad 13
$LL3@atomic_flo:
 movss xmm0, DWORD PTR [rcx]
 movss DWORD PTR lorg$[rsp], xmm0
 addss xmm0, xmm1
 mov eax, DWORD PTR lorg$[rsp]
 movss DWORD PTR lnew$[rsp], xmm0
 mov edx, DWORD PTR lnew$[rsp]
 lock cmpxchg DWORD PTR [rcx], edx
 jne SHORT $LL3@atomic_flo
 ret 0
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z ENDP

EDIT: here's a bit of context on how this code is used:

Profile showing that atomic_float_add() is a hot spot

As you can see, atomic_float_add() is taking the bulk of the time of FilteredTile::add(), which itself is the most expensive function of the whole program in this particular setup, taking nearly 27% of the overall time.

Question 2

Can you not use std::atomic<>?

Question 3

For various reasons our code base is stuck to C++03 for at least another year, so no std::atomic. This is beyond our control. In any case, C++11's and Boost's atomics require a type change which is not acceptable in our context.

Question 4

Besides trying SIMD I don't think there would be anything else to optimize. Also, ideally you would try to remove thread contention so that your work units can run in parallel without locking.

Question 5

Thanks @glampert. The work unit work in parallel; this is just the place where their results are merged. Since the purpose is interactive rendering, the results need to be merged regularly to provide visual feedback. Regarding SIMD: unfortunately I'm not sure that would help since AFAIK there's no way to SIMD compare-and-exchange primitive. Regarding performance, I was wondering if for instance there were weakers variants of InterlockedCompareExchange, something with weaker semantics maybe?

Question 6

Given that InterlockedCompareExchange returns the current value, you do not have to load from ptr in the loop if you reverse one of the binary_casts.

Question 7

@EOF commented that InterlockedCompareExchange (and its gcc counterpart, __sync_val_compare_and_swap) returns the initial value of the destination address. This allows to remove one memory load from the retry loop.

Here is the new version with this optimization:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 assert(is_aligned(ptr, 4));
 volatile LONG* iptr = reinterpret_cast<volatile LONG*>(ptr);
 LONG expected = *iptr;
 while (true)
 {
 const float value = binary_cast<float>(expected);
 const LONG new_value = binary_cast<LONG>(value + operand);
 const LONG actual = InterlockedCompareExchange(iptr, new_value, expected);
 if (actual == expected)
 return;
 expected = actual;
 }
}

Here is the corresponding assembly:

?atomic_float_add@foundation@@YAXPECMM@Z PROC
 mov eax, DWORD PTR [rcx]
 mov r8, rcx
 mov DWORD PTR value2ドル[rsp], eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov edx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [rcx], edx
 je SHORT $LN16@atomic_float_add
 npad 13
$LL3@atomic_float_add:
 mov DWORD PTR value2ドル[rsp], eax
 mov edx, eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov ecx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [r8], ecx
 cmp eax, edx
 jne SHORT $LL3@atomic_float_add
$LN16@atomic_float_add:
 ret 0
?atomic_float_add@foundation@@YAXPECMM@Z ENDP

Interestingly, the compiler decided to do one iteration before entering the loop. I'm not sure I understand the benefit of this...

Question 8

In order to not overlook the obvious, adding heuristics:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 if (operand == 0) {
 return;
 }
 ...

and

if (weight == 0) {
} else if (weight == 1) {
 memcopy(ptr, values, e * sizeof(float));
 ptr += e;
} else {
 for (size_t i = 0, e = m_channel_count - 1; i < e; ++i)
 {
 *ptr++ += values[i] * weight;
 }
}

Where the comparisons might add an epsilon.

It depends on the frequencies of 0.0 and 1.0. Which I am afraid you tested already.

Question 9

Hi, thanks for the input, appreciated! Unfortunately operand is almost never 0 or 1 so there's nothing to gain here...

score 2 · Answer 1 · 2016-08-08 18:25:37Z

@EOF commented that InterlockedCompareExchange (and its gcc counterpart, __sync_val_compare_and_swap) returns the initial value of the destination address. This allows to remove one memory load from the retry loop.

Here is the new version with this optimization:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 assert(is_aligned(ptr, 4));
 volatile LONG* iptr = reinterpret_cast<volatile LONG*>(ptr);
 LONG expected = *iptr;
 while (true)
 {
 const float value = binary_cast<float>(expected);
 const LONG new_value = binary_cast<LONG>(value + operand);
 const LONG actual = InterlockedCompareExchange(iptr, new_value, expected);
 if (actual == expected)
 return;
 expected = actual;
 }
}

Here is the corresponding assembly:

?atomic_float_add@foundation@@YAXPECMM@Z PROC
 mov eax, DWORD PTR [rcx]
 mov r8, rcx
 mov DWORD PTR value2ドル[rsp], eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov edx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [rcx], edx
 je SHORT $LN16@atomic_float_add
 npad 13
$LL3@atomic_float_add:
 mov DWORD PTR value2ドル[rsp], eax
 mov edx, eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov ecx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [r8], ecx
 cmp eax, edx
 jne SHORT $LL3@atomic_float_add
$LN16@atomic_float_add:
 ret 0
?atomic_float_add@foundation@@YAXPECMM@Z ENDP

Interestingly, the compiler decided to do one iteration before entering the loop. I'm not sure I understand the benefit of this...

Joop Eggen Joop EggenJoop Eggen 4,52115 silver badges19 bronze badges · Answer 2 · 2016-08-09 07:03:31Z

In order to not overlook the obvious, adding heuristics:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 if (operand == 0) {
 return;
 }
 ...

and

if (weight == 0) {
} else if (weight == 1) {
 memcopy(ptr, values, e * sizeof(float));
 ptr += e;
} else {
 for (size_t i = 0, e = m_channel_count - 1; i < e; ++i)
 {
 *ptr++ += values[i] * weight;
 }
}

Where the comparisons might add an epsilon.

It depends on the frequencies of 0.0 and 1.0. Which I am afraid you tested already.

Hi, thanks for the input, appreciated! Unfortunately operand is almost never 0 or 1 so there's nothing to gain here...

Stack Exchange Network

Atomic floating-point addition

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Atomic floating-point addition

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions