Fractional indexing and interpolation to reconstruct an image

Question 1

I have a c++ project written as a Matlab Mex program that is constructing an image composed of IxJ pixels. Each pixel value is calculated by finding the contributions to that pixel from all the channels (columns) in a KxL RF data matrix. The exact value of the row index 'k' of a given column 'l' that contributes to the pixel (i,j) can be a fractional value, so I need to interpolate between values in the RF data matrix. For each pixel (i,j), I calculate the row indices across all columns L that contribute to that pixel, interpolate and extract the corresponding values from the RF data matrix, sum them and store them at (i,j). Here are the code chunks that accomplish this:

Main Loop:

for(int i = 0; i < p.na; i++) {
 bfm(p,i,reconFrame,
 SIG(Eigen::seq(p.startSample(i)-1,p.endSample(i)-1),p.ConnMap.array()-1));
 Recon += hilbert(p,reconFrame,p1,p2);
 }

bfm method:

void bfm(const Parameters& p, int& idx, Eigen::Ref<Eigen::MatrixXd> bfFrame,
 const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG) {
 #pragma omp parallel for
 for (int i = 0; i < p.nPoints; i++) {
 Eigen::VectorXd idxt(p.numEl); 
 int col = i / p.szZ;
 int row = i % p.szZ;
 calcIDXT(p,p.xCoord(col),p.zCoord(row),idx,idxt);// Calculate delay indices
 calcPoint(idxt,SIG,bfFrame(row,col));
 }
}

CalcIDXT method:

void calcIDXT(const Parameters& p, const double& XPSF, const double& ZPSF, 
 const int& idx, Eigen::Ref<Eigen::VectorXd> idxt) {
 
 double dTX = ( (sgn(-p.TXangle(idx))*p.L/2 - XPSF)*sin(-p.TXangle(idx)) + ZPSF*cos(p.TXangle(idx)))*p.fs/p.c + p.t0*p.fs;
 idxt = ( (XPSF - p.ElemPos.array()).square() + ZPSF*ZPSF ).sqrt()*p.fs/p.c + dTX - 1; 
 idxt = ((abs(XPSF-p.ElemPos.array()) <= ZPSF*0.5/p.fnumber) && (idxt.array() <= (p.szRFframe - 2)) ).select(idxt,0.0);
}

CalcPoint method:

void calcPoint(const Eigen::Ref<const Eigen::VectorXd>& idxt,
 const Eigen::Ref<const Eigen::Matrix<short int, -1, -1>>& SIG,
 double& bfVal) {
 
 for (int j = 0; j < idxt.size(); j++) {
 bfVal += computePoint(idxt.data()+j,SIG.col(j).data());
 }
}

ComputePoint Method:

// pIndex: pointer to the current idxt value.
// pSignsColumn: pointer to the first element in the column of SIG
inline double computePoint( const double* pIndex, const int16_t* pSignsColumn)//, const double* validIDX )
{
 // Load the index value into both lanes of the vector
 __m128d idx = _mm_loaddup_pd( pIndex );
 // Convert into int32 with truncation; this assumes the number there ain't negative.
 const int iFloor = _mm_cvttsd_si32( idx );
 // Compute fractional part
 idx = _mm_sub_pd( idx, _mm_floor_pd( idx ) );
 // Compute interpolation coefficients, they are [ 1.0 - idx, idx ]
 // _mm_set_sd copies 1.0 to lower half and zero to upper half
 // _mm_addsub_pd(a,b) subtracts lower half of b from a and adds upper half of b to a
 idx = _mm_addsub_pd( _mm_set_sd( 1.0 ), idx );
 #define _mm_loadu_si32(p) _mm_cvtsi32_si128(*(const int*)(p))
 // Load two int16_t values from sequential addresses
 const __m128i signsInt = _mm_loadu_si32( pSignsColumn + iFloor );
 // Upcast them to int32, then to fp64
 const __m128d signs = _mm_cvtepi32_pd( _mm_cvtepi16_epi32( signsInt ) );
 // Load validation index into vector
// __m128d vIdx = _mm_loaddup_pd( validIDX );
 // Compute the result
// __m128d res = _mm_mul_pd( idx, signs );
// res = _mm_add_sd( res, _mm_unpackhi_pd( res, res ) );
 // The above 2 lines (3 instructions) can be replaced with the following one:
 const __m128d res = _mm_dp_pd( idx, signs, 0b110001 );
 // It may or may not be better, the dppd instruction is not particularly fast.
 return _mm_cvtsd_f64( res );
}

The main loop calculates multiple images that correspond to different chunks of the RF (SIG) matrix. CalcIDXT calculates the row indices for each channel or column of the RF matrix that correspond to a given point while calcPoint serves as a wrapper for the SIMD intrinsic instructions and iterates through each channel of data. The SIMD method actually performs the interpolation (linear interpolation in this case). There is a hilbert transform of all the data that uses the fftw library.

I'm looking for feedback on best practices and performance here. To give you some idea of the current performance I'm seeing, I used the c++ chrono library to measure the execution time of various chunks of my code. Here are the results:

Parameter initialization: 0.3419
RF Data initialization: 22.7536
bfm total runtime over all loop iterations: 349.5448
Hilbert total runtime over all loop iterations: 45.8686
Sum of above numbers: 418.5088
Total runtime of mex function according to tic/toc in matlab: 423.7401

These are averaged over 100 runs and units are in milliseconds. In this test case, the main loop has 15 iterations so each call to hilbert and bfm are the above values divided by 15. I believe I should be able to do better than this, particularly in the calls to bfm. I suspect I should be able to improve by a factor of 2 at least since a commercial software that performs a similar function is about a factor of 2 faster.

Question 2

What do the "signs" look like? If it's ones and negative ones, that could be useful for perf tricks.

Question 3

Those are poorly named. They aren't signs, they correspond to values from the SIG (RF data) array loaded into a vector register. The RF data array is of int16 data format and needs to be upcast.

Question 4

`dppd`

the dppd instruction is not particularly fast.

Yes, there is a general principle: try to delay horizontal operations until after the loop. That's not always possible, but that principle can be applied here by having computePoint return the product of idx and signs as an __m128d without horizontally summing it, then calcPoint can add up those vectors and only do one horizontal sum, after the loop.

Passing doubles by reference or address

Generally, when primitive types can be passed by value (even if that requires returning a new value), they should be. It's not always bad to pass them by reference, but it's rarely good. In the case of computePoint, it makes sense to pass in pIndex by address, because it's going to get movddup'ed and it's better to do that directly from memory, but in the case of calcPoint I expect that passing bfVal by reference is not good, resulting in a load/store (vaddsd xmm0, xmm0, QWORD PTR [r12] \ vmovsd QWORD PTR [r12], xmm0) to update the variable "at a distance" rather than summing into a register and storing it once in the end. This is also automatically fixed if dppd is removed in the way that I described above.

`SIG.col(j).data()`

I seriously underestimated how bad such an innocent-looking expression could be, anything I just mentioned looks really small compared to this. After compiling the code though (with optimization level -O3), it seems that all or most of this is associated with just that column-data access:

 imul rdx, rbx
 mov r8, QWORD PTR [rbp+8]
 mov rax, QWORD PTR [rbp+0]
 mov rdi, rsp
 mov ecx, 1
 lea rsi, [rax+rdx*2]
 mov rdx, r8
 call Eigen::MapBase<Eigen::Block<Eigen::Ref<Eigen::Matrix<short, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const, -1, 1, true>, 0>::MapBase(short const*, long, long) [base object constructor]
 mov rdx, QWORD PTR [rbp+24]
 mov QWORD PTR [rsp+24], rbp
 mov QWORD PTR [rsp+32], 0
 mov QWORD PTR [rsp+40], rbx
 mov QWORD PTR [rsp+48], rdx
 cmp QWORD PTR [rbp+16], rbx
 jle .L20
 mov rax, QWORD PTR [r13+0]
 vmovq xmm2, QWORD PTR .LC9[rip]
 mov rcx, QWORD PTR [rsp]
 vmovsd xmm1, QWORD PTR [rax+rbx*8]

Horrible. I'm not sure what to do about it, maybe SIG.data() + j * SIG.rows() works, assuming that the matrix is column-major? You probably know more about Eigen than I do though.

`CalcIDXT`

I don't really know what's going on there, but it may be possible to increase performance by doing all or most of that with vector intrinsics so that the resulting indexes can be fed directly into the interpolation math, rather than going through memory first.

AVX2

It seems to me that a AVX2 version of computePoint is worth investigating, it could do twice the amount of work in (I think) less than twice the amount of time. It wouldn't reach 2x throughput overall because there would be some annoying shuffles and inserts and two independent signs would be loaded, but it could still be a win.

Question 5

I would be careful with SIG.data() + j * SIG.rows(); Eigen allows matrices to have a different stride than the number of rows.

Question 6

@G.Sliepen what do you recommend as a replacement?

Question 7

I'm not sure a better replacement is possible, then again I'm not a big user of Eigen myself.

Question 8

Eigen has a new interface for indexing that looks more like the way matlab does indexing. There's a description in the documentation somewhere that I'm too lazy to find right now for anyone reading this who's curious about it. However, I fortunately have the sizes of SIG stored in my 'p' data structure. I'll just try using that.

Question 9

Thanks for all this feedback. I'll try it later and lyk how it works.

user555045 user555045 12.1k1 gold badge19 silver badges37 bronze badges · Accepted Answer · 2022-03-26 05:20:21Z

`dppd`

the dppd instruction is not particularly fast.

Yes, there is a general principle: try to delay horizontal operations until after the loop. That's not always possible, but that principle can be applied here by having computePoint return the product of idx and signs as an __m128d without horizontally summing it, then calcPoint can add up those vectors and only do one horizontal sum, after the loop.

Passing doubles by reference or address

Generally, when primitive types can be passed by value (even if that requires returning a new value), they should be. It's not always bad to pass them by reference, but it's rarely good. In the case of computePoint, it makes sense to pass in pIndex by address, because it's going to get movddup'ed and it's better to do that directly from memory, but in the case of calcPoint I expect that passing bfVal by reference is not good, resulting in a load/store (vaddsd xmm0, xmm0, QWORD PTR [r12] \ vmovsd QWORD PTR [r12], xmm0) to update the variable "at a distance" rather than summing into a register and storing it once in the end. This is also automatically fixed if dppd is removed in the way that I described above.

`SIG.col(j).data()`

I seriously underestimated how bad such an innocent-looking expression could be, anything I just mentioned looks really small compared to this. After compiling the code though (with optimization level -O3), it seems that all or most of this is associated with just that column-data access:

 imul rdx, rbx
 mov r8, QWORD PTR [rbp+8]
 mov rax, QWORD PTR [rbp+0]
 mov rdi, rsp
 mov ecx, 1
 lea rsi, [rax+rdx*2]
 mov rdx, r8
 call Eigen::MapBase<Eigen::Block<Eigen::Ref<Eigen::Matrix<short, -1, -1, 0, -1, -1> const, 0, Eigen::OuterStride<-1> > const, -1, 1, true>, 0>::MapBase(short const*, long, long) [base object constructor]
 mov rdx, QWORD PTR [rbp+24]
 mov QWORD PTR [rsp+24], rbp
 mov QWORD PTR [rsp+32], 0
 mov QWORD PTR [rsp+40], rbx
 mov QWORD PTR [rsp+48], rdx
 cmp QWORD PTR [rbp+16], rbx
 jle .L20
 mov rax, QWORD PTR [r13+0]
 vmovq xmm2, QWORD PTR .LC9[rip]
 mov rcx, QWORD PTR [rsp]
 vmovsd xmm1, QWORD PTR [rax+rbx*8]

Horrible. I'm not sure what to do about it, maybe SIG.data() + j * SIG.rows() works, assuming that the matrix is column-major? You probably know more about Eigen than I do though.

`CalcIDXT`

I don't really know what's going on there, but it may be possible to increase performance by doing all or most of that with vector intrinsics so that the resulting indexes can be fed directly into the interpolation math, rather than going through memory first.

AVX2

It seems to me that a AVX2 version of computePoint is worth investigating, it could do twice the amount of work in (I think) less than twice the amount of time. It wouldn't reach 2x throughput overall because there would be some annoying shuffles and inserts and two independent signs would be loaded, but it could still be a win.

I would be careful with SIG.data() + j * SIG.rows(); Eigen allows matrices to have a different stride than the number of rows.
I'm not sure a better replacement is possible, then again I'm not a big user of Eigen myself.
Eigen has a new interface for indexing that looks more like the way matlab does indexing. There's a description in the documentation somewhere that I'm too lazy to find right now for anyone reading this who's curious about it. However, I fortunately have the sizes of SIG stored in my 'p' data structure. I'll just try using that.
Thanks for all this feedback. I'll try it later and lyk how it works.

Stack Exchange Network

Fractional indexing and interpolation to reconstruct an image

1 Answer 1

`dppd`

Passing doubles by reference or address

`SIG.col(j).data()`

`CalcIDXT`

AVX2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Fractional indexing and interpolation to reconstruct an image

1 Answer 1

dppd

Passing doubles by reference or address

SIG.col(j).data()

CalcIDXT

AVX2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions

`dppd`

`SIG.col(j).data()`

`CalcIDXT`