6
\$\begingroup\$

I need to be able to atomically add a 32-bit floating point value to a location in memory. Here is what I came up with. While the code is Windows-specific, I'll extend it with Linux support using __sync_bool_compare_and_swap().

While I tested this code and it appeared to work fine, I'll appreciate a fresh look from another pair of eyes to make sure this code is 100% safe.

I'm also very much interested in any performance advice as this code is used in a hot code path in a performance-sensitive application (interactive rendering).

atomic_float_add(), the main function:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 assert(is_aligned(ptr, 4));
 volatile LONG* lptr = reinterpret_cast<volatile LONG*>(ptr);
 LONG lorg, lnew;
 do
 {
 const float forg = *ptr;
 const float fnew = forg + operand;
 lorg = binary_cast<LONG>(forg);
 lnew = binary_cast<LONG>(fnew);
 } while (InterlockedCompareExchange(lptr, lnew, lorg) != lorg);
}

binary_cast() is implemented in such a way that it obeys strict aliasing rules:

template <typename Target, typename Source>
inline Target binary_cast(Source s)
{
 BOOST_STATIC_ASSERT(sizeof(Target) == sizeof(Source));
 union
 {
 Source m_source;
 Target m_target;
 } u;
 u.m_source = s;
 return u.m_target;
}

is_aligned():

template <typename T>
inline bool is_aligned(const T ptr, const size_t alignment)
{
 assert(alignment > 0);
 assert(is_pow2(alignment));
 const uintptr_t p = (uintptr_t)ptr;
 return (p & (alignment - 1)) == 0;
}

is_pow2():

template <typename T>
inline bool is_pow2(const T x)
{
 return (x & (x - 1)) == 0;
}

Assembly output for atomic_float_add() (Visual Studio 2013, full optimization):

lorg$ = 8
ptr$ = 8
lnew$ = 16
operand$ = 16
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z PROC
 prefetchw BYTE PTR [rcx]
 npad 13
$LL3@atomic_flo:
 movss xmm0, DWORD PTR [rcx]
 movss DWORD PTR lorg$[rsp], xmm0
 addss xmm0, xmm1
 mov eax, DWORD PTR lorg$[rsp]
 movss DWORD PTR lnew$[rsp], xmm0
 mov edx, DWORD PTR lnew$[rsp]
 lock cmpxchg DWORD PTR [rcx], edx
 jne SHORT $LL3@atomic_flo
 ret 0
?atomic_float_add@?A0x1534477b@foundation@@YAXPECMM@Z ENDP

EDIT: here's a bit of context on how this code is used:

Profile showing that atomic_float_add() is a hot spot

As you can see, atomic_float_add() is taking the bulk of the time of FilteredTile::add(), which itself is the most expensive function of the whole program in this particular setup, taking nearly 27% of the overall time.

asked Jul 25, 2016 at 13:31
\$\endgroup\$
6
  • \$\begingroup\$ Can you not use std::atomic<>? \$\endgroup\$ Commented Jul 25, 2016 at 17:31
  • \$\begingroup\$ For various reasons our code base is stuck to C++03 for at least another year, so no std::atomic. This is beyond our control. In any case, C++11's and Boost's atomics require a type change which is not acceptable in our context. \$\endgroup\$ Commented Jul 25, 2016 at 18:26
  • \$\begingroup\$ Besides trying SIMD I don't think there would be anything else to optimize. Also, ideally you would try to remove thread contention so that your work units can run in parallel without locking. \$\endgroup\$ Commented Jul 25, 2016 at 18:38
  • \$\begingroup\$ Thanks @glampert. The work unit work in parallel; this is just the place where their results are merged. Since the purpose is interactive rendering, the results need to be merged regularly to provide visual feedback. Regarding SIMD: unfortunately I'm not sure that would help since AFAIK there's no way to SIMD compare-and-exchange primitive. Regarding performance, I was wondering if for instance there were weakers variants of InterlockedCompareExchange, something with weaker semantics maybe? \$\endgroup\$ Commented Jul 25, 2016 at 18:41
  • 1
    \$\begingroup\$ Given that InterlockedCompareExchange returns the current value, you do not have to load from ptr in the loop if you reverse one of the binary_casts. \$\endgroup\$ Commented Aug 8, 2016 at 12:51

2 Answers 2

2
\$\begingroup\$

@EOF commented that InterlockedCompareExchange (and its gcc counterpart, __sync_val_compare_and_swap) returns the initial value of the destination address. This allows to remove one memory load from the retry loop.

Here is the new version with this optimization:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 assert(is_aligned(ptr, 4));
 volatile LONG* iptr = reinterpret_cast<volatile LONG*>(ptr);
 LONG expected = *iptr;
 while (true)
 {
 const float value = binary_cast<float>(expected);
 const LONG new_value = binary_cast<LONG>(value + operand);
 const LONG actual = InterlockedCompareExchange(iptr, new_value, expected);
 if (actual == expected)
 return;
 expected = actual;
 }
}

Here is the corresponding assembly:

?atomic_float_add@foundation@@YAXPECMM@Z PROC
 mov eax, DWORD PTR [rcx]
 mov r8, rcx
 mov DWORD PTR value2ドル[rsp], eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov edx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [rcx], edx
 je SHORT $LN16@atomic_float_add
 npad 13
$LL3@atomic_float_add:
 mov DWORD PTR value2ドル[rsp], eax
 mov edx, eax
 movss xmm0, DWORD PTR value2ドル[rsp]
 addss xmm0, xmm1
 movss DWORD PTR new_value1ドル[rsp], xmm0
 mov ecx, DWORD PTR new_value1ドル[rsp]
 lock cmpxchg DWORD PTR [r8], ecx
 cmp eax, edx
 jne SHORT $LL3@atomic_float_add
$LN16@atomic_float_add:
 ret 0
?atomic_float_add@foundation@@YAXPECMM@Z ENDP

Interestingly, the compiler decided to do one iteration before entering the loop. I'm not sure I understand the benefit of this...

answered Aug 8, 2016 at 18:25
\$\endgroup\$
0
\$\begingroup\$

In order to not overlook the obvious, adding heuristics:

__forceinline void atomic_float_add(volatile float* ptr, const float operand)
{
 if (operand == 0) {
 return;
 }
 ...

and

if (weight == 0) {
} else if (weight == 1) {
 memcopy(ptr, values, e * sizeof(float));
 ptr += e;
} else {
 for (size_t i = 0, e = m_channel_count - 1; i < e; ++i)
 {
 *ptr++ += values[i] * weight;
 }
}

Where the comparisons might add an epsilon.

It depends on the frequencies of 0.0 and 1.0. Which I am afraid you tested already.

answered Aug 9, 2016 at 7:03
\$\endgroup\$
1
  • \$\begingroup\$ Hi, thanks for the input, appreciated! Unfortunately operand is almost never 0 or 1 so there's nothing to gain here... \$\endgroup\$ Commented Aug 11, 2016 at 6:53

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.