I have a c++ project written as a Matlab Mex program that is constructing an image composed of IxJ pixels. Each pixel value is calculated by finding the contributions to that pixel from all the channels (columns) in a KxL RF data matrix. The exact value of the row index 'k' of a given column 'l' that contributes to the pixel (i,j) can be a fractional value, so I need to interpolate between values in the RF data matrix. For each pixel (i,j), I calculate the row indices across all columns L that contribute to that pixel, interpolate and extract the corresponding values from the RF data matrix, sum them and store them at (i,j). Here are the code chunks that accomplish this:
Main Loop:
for(int i = 0; i < p.na; i++) {
bfm(p,i,reconFrame,
SIG(Eigen::seq(p.startSample(i)-1,p.endSample(i)-1),p.ConnMap.array()-1));
Recon += hilbert(p,reconFrame,p1,p2);
}
bfm method:
void bfm(const Parameters& p, int& idx, Eigen::Ref<Eigen::MatrixXd> bfFrame,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG) {
#pragma omp parallel for
for (int i = 0; i < p.nPoints; i++) {
Eigen::VectorXd idxt(p.numEl);
int col = i / p.szZ;
int row = i % p.szZ;
calcIDXT(p,p.xCoord(col),p.zCoord(row),idx,idxt);// Calculate delay indices
calcPoint(idxt,SIG,bfFrame(row,col));
}
}
CalcIDXT method:
void calcIDXT(const Parameters& p, const double& XPSF, const double& ZPSF,
const int& idx, Eigen::Ref<Eigen::VectorXd> idxt) {
double dTX = ( (sgn(-p.TXangle(idx))*p.L/2 - XPSF)*sin(-p.TXangle(idx)) + ZPSF*cos(p.TXangle(idx)))*p.fs/p.c + p.t0*p.fs;
idxt = ( (XPSF - p.ElemPos.array()).square() + ZPSF*ZPSF ).sqrt()*p.fs/p.c + dTX - 1;
idxt = ((abs(XPSF-p.ElemPos.array()) <= ZPSF*0.5/p.fnumber) && (idxt.array() <= (p.szRFframe - 2)) ).select(idxt,0.0);
}
CalcPoint method:
void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
double& bfVal) {
for (int j = 0; j < idxt.size(); j++) {
bfVal += computePoint(idxt.data()+j,SIG.col(j).data());
}
}
ComputePoint Method:
// pIndex: pointer to the current idxt value.
// pSignsColumn: pointer to the first element in the column of SIG
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn)//, const double* validIDX )
{
// Load the index value into both lanes of the vector
__m128d idx = _mm_loaddup_pd( pIndex );
// Convert into int32 with truncation; this assumes the number there ain't negative.
const int iFloor = _mm_cvttsd_si32( idx );
// Compute fractional part
idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
// Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
// _mm_set_sd copies 1.0 to lower half and zero to upper half
// _mm_addsub_pd(a,b) subtracts lower half of b from a and adds upper half of b to a
idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
#define _mm_loadu_si32(p) _mm_cvtsi32_si128(*(const int*)(p))
// Load two int16_t values from sequential addresses
const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
// Upcast them to int32, then to fp64
const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
// Load validation index into vector
// __m128d vIdx = _mm_loaddup_pd( validIDX );
// Compute the result
// __m128d res = _mm_mul_pd( idx, signs );
// res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
// The above 2 lines (3 instructions) can be replaced with the following one:
const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
// It may or may not be better, the dppd instruction is not particularly fast.
return _mm_cvtsd_f64( res );
}
The main loop calculates multiple images that correspond to different chunks of the RF (SIG) matrix. CalcIDXT calculates the row indices for each channel or column of the RF matrix that correspond to a given point while calcPoint serves as a wrapper for the SIMD intrinsic instructions and iterates through each channel of data. The SIMD method actually performs the interpolation (linear interpolation in this case). There is a hilbert transform of all the data that uses the fftw library.
I'm looking for feedback on best practices and performance here. To give you some idea of the current performance I'm seeing, I used the c++ chrono library to measure the execution time of various chunks of my code. Here are the results:
Parameter initialization: 0.3419
RF Data initialization: 22.7536
bfm total runtime over all loop iterations: 349.5448
Hilbert total runtime over all loop iterations: 45.8686
Sum of above numbers: 418.5088
Total runtime of mex function according to tic/toc in matlab: 423.7401
These are averaged over 100 runs and units are in milliseconds. In this test case, the main loop has 15 iterations so each call to hilbert and bfm are the above values divided by 15. I believe I should be able to do better than this, particularly in the calls to bfm. I suspect I should be able to improve by a factor of 2 at least since a commercial software that performs a similar function is about a factor of 2 faster.
-
\$\begingroup\$ What do the "signs" look like? If it's ones and negative ones, that could be useful for perf tricks. \$\endgroup\$user555045– user5550452022年03月26日 03:46:30 +00:00Commented Mar 26, 2022 at 3:46
-
\$\begingroup\$ Those are poorly named. They aren't signs, they correspond to values from the SIG (RF data) array loaded into a vector register. The RF data array is of int16 data format and needs to be upcast. \$\endgroup\$drakon101– drakon1012022年03月26日 05:12:33 +00:00Commented Mar 26, 2022 at 5:12
1 Answer 1
dppd
the dppd instruction is not particularly fast.
Yes, there is a general principle: try to delay horizontal operations until after the loop. That's not always possible, but that principle can be applied here by having computePoint
return the product of idx
and signs
as an __m128d
without horizontally summing it, then calcPoint
can add up those vectors and only do one horizontal sum, after the loop.
Passing doubles by reference or address
Generally, when primitive types can be passed by value (even if that requires returning a new value), they should be. It's not always bad to pass them by reference, but it's rarely good. In the case of computePoint
, it makes sense to pass in pIndex
by address, because it's going to get movddup
'ed and it's better to do that directly from memory, but in the case of calcPoint
I expect that passing bfVal
by reference is not good, resulting in a load/store (vaddsd xmm0, xmm0, QWORD PTR [r12] \ vmovsd QWORD PTR [r12], xmm0
) to update the variable "at a distance" rather than summing into a register and storing it once in the end. This is also automatically fixed if dppd
is removed in the way that I described above.
SIG.col(j).data()
I seriously underestimated how bad such an innocent-looking expression could be, anything I just mentioned looks really small compared to this. After compiling the code though (with optimization level -O3
), it seems that all or most of this is associated with just that column-data access:
imul rdx, rbx
mov r8, QWORD PTR [rbp+8]
mov rax, QWORD PTR [rbp+0]
mov rdi, rsp
mov ecx, 1
lea rsi, [rax+rdx*2]
mov rdx, r8
call Eigen::MapBase<Eigen::Block<Eigen::Ref<Eigen::Matrix<short, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const, -1, 1, true>, 0>::MapBase(short const*, long, long) [base object constructor]
mov rdx, QWORD PTR [rbp+24]
mov QWORD PTR [rsp+24], rbp
mov QWORD PTR [rsp+32], 0
mov QWORD PTR [rsp+40], rbx
mov QWORD PTR [rsp+48], rdx
cmp QWORD PTR [rbp+16], rbx
jle .L20
mov rax, QWORD PTR [r13+0]
vmovq xmm2, QWORD PTR .LC9[rip]
mov rcx, QWORD PTR [rsp]
vmovsd xmm1, QWORD PTR [rax+rbx*8]
Horrible. I'm not sure what to do about it, maybe SIG.data() + j * SIG.rows()
works, assuming that the matrix is column-major? You probably know more about Eigen than I do though.
CalcIDXT
I don't really know what's going on there, but it may be possible to increase performance by doing all or most of that with vector intrinsics so that the resulting indexes can be fed directly into the interpolation math, rather than going through memory first.
AVX2
It seems to me that a AVX2 version of computePoint
is worth investigating, it could do twice the amount of work in (I think) less than twice the amount of time. It wouldn't reach 2x throughput overall because there would be some annoying shuffles and inserts and two independent signs would be loaded, but it could still be a win.
-
\$\begingroup\$ I would be careful with
SIG.data() + j * SIG.rows()
; Eigen allows matrices to have a different stride than the number of rows. \$\endgroup\$G. Sliepen– G. Sliepen2022年03月26日 10:20:00 +00:00Commented Mar 26, 2022 at 10:20 -
\$\begingroup\$ @G.Sliepen what do you recommend as a replacement? \$\endgroup\$user555045– user5550452022年03月26日 10:28:16 +00:00Commented Mar 26, 2022 at 10:28
-
\$\begingroup\$ I'm not sure a better replacement is possible, then again I'm not a big user of Eigen myself. \$\endgroup\$G. Sliepen– G. Sliepen2022年03月26日 10:37:59 +00:00Commented Mar 26, 2022 at 10:37
-
\$\begingroup\$ Eigen has a new interface for indexing that looks more like the way matlab does indexing. There's a description in the documentation somewhere that I'm too lazy to find right now for anyone reading this who's curious about it. However, I fortunately have the sizes of SIG stored in my 'p' data structure. I'll just try using that. \$\endgroup\$drakon101– drakon1012022年03月26日 15:59:32 +00:00Commented Mar 26, 2022 at 15:59
-
\$\begingroup\$ Thanks for all this feedback. I'll try it later and lyk how it works. \$\endgroup\$drakon101– drakon1012022年03月26日 16:00:55 +00:00Commented Mar 26, 2022 at 16:00