I have this code to get audio output levels in dB to an array (peak_dB[8]) to be used in real time peakmeter:
#include <emmintrin.h>
float channelsPeak[8] = { 0 };
float peak_dB[8] = { 0 };
__m128 log2_sse(__m128 x) {
// https://www.kvraudio.com/forum/viewtopic.php?p=7524831#p7524831
// 12-13ulp
const __m128 c0 = _mm_set1_ps(1.011593342e+01f);
const __m128 c1 = _mm_set1_ps(1.929443550e+01f);
const __m128 d0 = _mm_set1_ps(2.095932245e+00f);
const __m128 d1 = _mm_set1_ps(1.266638851e+01f);
const __m128 d2 = _mm_set1_ps(6.316540241e+00f);
const __m128 one = _mm_set1_ps(1.0f);
const __m128 multi = _mm_set1_ps(1.41421356237f);
const __m128i mantissa_mask = _mm_set1_epi32((1 << 23) - 1);
__m128i x_i = _mm_castps_si128(x);
__m128i spl_exp = _mm_castps_si128(_mm_mul_ps(x, multi));
spl_exp = _mm_sub_epi32(spl_exp, _mm_castps_si128(one));
spl_exp = _mm_andnot_si128(mantissa_mask, spl_exp);
__m128 spl_mantissa = _mm_castsi128_ps(_mm_sub_epi32(x_i, spl_exp));
spl_exp = _mm_srai_epi32(spl_exp, 23);
__m128 log2_exponent = _mm_cvtepi32_ps(spl_exp);
__m128 num = spl_mantissa;
num = _mm_add_ps(num, c1);
num = _mm_mul_ps(num, spl_mantissa);
num = _mm_add_ps(num, c0);
num = _mm_mul_ps(num, _mm_sub_ps(spl_mantissa, one));
__m128 denom = d2;
denom = _mm_mul_ps(denom, spl_mantissa);
denom = _mm_add_ps(denom, d1);
denom = _mm_mul_ps(denom, spl_mantissa);
denom = _mm_add_ps(denom, d0);
__m128 res = _mm_div_ps(num, denom);
res = _mm_add_ps(log2_exponent, res);
return res;
}
__m128 lin2db(__m128 x) {
const __m128 convert_10 = _mm_set1_ps(6.02059991328f);
return _mm_mul_ps(log2_sse(x), convert_10);
}
float getPeaks_dB(int cCount) { // cCount = enabled channels (1...8)
__m128 s1 = _mm_setzero_ps();
s1 = _mm_set_ps(channelsPeak[3], channelsPeak[2], channelsPeak[1], channelsPeak[0]); //channels 1-4
s1 = lin2db(s1);
_mm_store_ps(peak_dB, s1);
if(cCount > 4){
float t2[4] = { 0 };
__m128 s2 = _mm_setzero_ps();
s2 = _mm_set_ps(channelsPeak[7], channelsPeak[6], channelsPeak[5], channelsPeak[4]); // channels 5-8
s2 = lin2db(s2);
_mm_store_ps(t2, s2);
for (int i = 4; i < 8; i++){peak_dB[i] = t2[i-4];}
}
return 0;
}
where float channelsPeak[0..7] array is storage for linear levels (0.0f..1.0f) of eight channels read from audio rendering device (one GetChannelsPeakValues() call), peak_dB is array of eight elements to hold this value in dB format (to be used in some later calculations and textual representation) and lin2db is 20log10(x) approximation (faster (because of lower accuracy) than std::log10) implemented using SSE intrinsics.
Q: Are there other (better) ways to insert data from s2 into last four elements of peak_dB (AVX excluded)? I'm using VS2013 and, by Compiler Explorer, compiler seem to improve for-loop as used in code now but, as the 1st part in peak_dB stored with _mm_store_ps(peak_dB, s1) looks there much simpler, just wondering if there's a way doing it without for-loop.
1 Answer 1
Storing
Are there other (better) ways to insert data from s2 into last four elements of peak_dB (AVX excluded)?
Yes, actually you already used that way: it's _mm_store_ps
. For example:
s2 = lin2db(s2);
_mm_store_ps(peak_dB + 4, s2);
Maybe you prefer &peak_dB[4]
instead of peak_dB + 4
, that works just fine too.
_mm_store_ps
takes a pointer to wherever you want to store the data, that pointer does not have to point to the start of an array.
Loading
An other problem here is the use of _mm_set_ps
. Though it accepts variable arguments, it is mostly meant for constant arguments, and good code is far from guaranteed if it is used differently. In the code on Godbolt, you can see the effect. Here I removed the code from lin2db
from it that got "interleaved" into it:
movss xmm1, DWORD PTR float * channelsPeak+12
movss xmm0, DWORD PTR float * channelsPeak+8
movss xmm2, DWORD PTR float * channelsPeak+4
movss xmm4, DWORD PTR float * channelsPeak
unpcklps xmm2, xmm1
unpcklps xmm4, xmm0
unpcklps xmm4, xmm2
And this is seen for the other similar instance of _mm_set_ps
as well:
movss xmm1, DWORD PTR float * channelsPeak+28
movss xmm0, DWORD PTR float * channelsPeak+24
movss xmm2, DWORD PTR float * channelsPeak+20
movss xmm3, DWORD PTR float * channelsPeak+16
unpcklps xmm3, xmm0
unpcklps xmm2, xmm1
unpcklps xmm3, xmm2
Avoid this pattern, try to use _mm_load(u)_ps
if possible. That's easy in this case:
s1 = _mm_loadu_ps(channelsPeak);
// later
s2 = _mm_loadu_ps(channelsPeak + 4);
Zeroing
Initializing local variables of type __m128
like this is fine:
__m128 s1 = _mm_setzero_ps();
There's no serious problem with that, but it does not do anything useful in this code, it's just redundant. Of course if the value of zero is used as an input to something, then it should be done. But here it can just as well be skipped, declaring the variable only when you have a value to assign to it, for example:
__m128 s1 = _mm_loadu_ps(channelsPeak);
...
in the code that means that the code as posted is incomplete and we have a tendency to close the question as off-topic because it isMissing Code Context
. The lack of code context makes it much more difficult to review the code in the question. \$\endgroup\$cPeak
but that was renamed tochannelsPeak
, right? Please update the description accordingly \$\endgroup\$