I am using a dual channels DAQ card with data stream mode. I wrote some code for analysis/calculation and put them to the main code for operation. However, the FIFO overflow warning sign always occur once its total data reach around 4300 MS (the DAQ on board memory is 8GB). I am well-known that a complicated calculation might retard the system and cause the overflow but all of the works I wrote are necessary to my experiment which means cannot be replaced (or there is more effective code can let me get the same result). Could anyone point out if there are any ways that I can optimize my code? My computer has 64GB RAM and Intel Core i7 processor. I always turn off other unnecessary software when running the data stream code.
This is how I process the data:
- Install the FFTW source code for the Hilbert transform.
- Allocate buffer size for channel1 and channel2.
- Use
for
loop to get ch1 and ch2 data from the main buffer. - Do the
hilbert
on ch2 and calculate itsabsolute
number. - Find the max value of ch1 and
abs(hilbert(ch2))
. - Calculate
max(abs(hilbert(ch2))) / max(ch1)
. Here is the detail of the my calculation code:
float SumBufferData(void* pBuffer, uInt32 u32Size, uInt32 u32SampleBits)
{
// In this routine we sum up all the samples in the buffer. This function
// should be replaced with the user's analysys function
uInt32 i,n;
uInt8* pu8Buffer = NULL;
int16* pi16Buffer = NULL;
int64 i64Sum = 0;
float max1 = 0.0;
float max2 = 0.0;
float min2 = 0.0;
float Corrected =0.0;
float AUC = 0.0;
float* ch1Buffer = NULL;
double* ch2Buffer = NULL;
if ( 8 == u32SampleBits )
{
pu8Buffer = (uInt8 *)pBuffer;
for (i = 0; i < u32Size; i++)
{
i64Sum += pu8Buffer[i];
}
}
else
{
pi16Buffer = (int16 *)pBuffer;
fftw_complex(hilbertedch2[N]);
ch1Buffer = (float*)calloc(u32Size, sizeof(float));
ch2Buffer = (double*)calloc(u32Size, sizeof(double));
// Divide ch1 and ch2 data from pi16Buffer
for (i = 0; i < u32Size/2; i++)
{
ch1Buffer[i] += pi16Buffer[i*2];
ch2Buffer[i] += pi16Buffer[i * 2 + 1];
}
// Here hilbert on the whole ch2
hilbert(ch2Buffer, hilbertedch2);
//Find max value in each segs of ch1 and hilbertedch2
for (i = 0; i < u32Size/2; i++)
{
if (ch1Buffer[i] > max1)
max1 = ch1Buffer[i];
if (abs(hilbertedch2[i][IMAG])> max2)
max2 = abs(hilbertedch2[i][IMAG]);
}
Corrected = max2 / max1; // Calculate the signal correction
if (Corrected > 0.1) // check the threshold and calculate area under curve
AUC += Corrected;
else
AUC = 0;
}
free(ch1Buffer);
free(ch2Buffer);
return AUC;
}
And the source code from FFTW:
void hilbert(const double* in, fftw_complex* out)
{
// copy the data to the complex array
for (int i = 0; i < N; ++i) {
out[i][REAL] = in[i];
out[i][IMAG] = 0;
}
// creat a DFT plan and execute it
fftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
fftw_execute(plan);
// destroy a plan to prevent memory leak
fftw_destroy_plan(plan);
int hN = N >> 1; // half of the length (N/2)
int numRem = hN; // the number of remaining elements
// multiply the appropriate value by 2
//(those should multiplied by 1 are left intact because they wouldn't change)
for (int i = 1; i < hN; ++i) {
out[i][REAL] *= 2;
out[i][IMAG] *= 2;
}
// if the length is even, the number of the remaining elements decrease by 1
if (N % 2 == 0)
numRem--;
else if (N > 1) {
out[hN][REAL] *= 2;
out[hN][IMAG] *= 2;
}
// set the remaining value to 0
// (multiplying by 0 gives 0, so we don't care about the multiplicands)
memset(&out[hN + 1][REAL], 0, numRem * sizeof(fftw_complex));
// creat a IDFT plan and execute it
plan = fftw_plan_dft_1d(N, out, out, FFTW_BACKWARD, FFTW_ESTIMATE);
fftw_execute(plan);
// do some cleaning
fftw_destroy_plan(plan);
fftw_cleanup();
// scale the IDFT output
for (int i = 0; i < N; ++i) {
out[i][REAL] /= N;
out[i][IMAG] /= N;
}
}
2 Answers 2
Don't allocate memory while processing the data
Reserve space for ch1Buffer
and ch2Buffer
once while your program initializes, so that there is enough to handle any value of u32Size
that you might pass to the function (since you are streaming data, I am guessing that this will be a constant anyway, and there's also the fact that the function hilbert()
doesn't take a size parameter but uses N
instead).
Calculate a plan once and reuse it
Similar to the memory allocation, you should create all the fftw_plan
s you need once, and then reuse them. You can then also use FFTW_MEASURE
to create more optimal plans. If you also need your program to start up fast, consider also using FFTW's wisdom mechanism.
double
or float
?
As chux also mentioned, you are mixing float
and double
variables. Do you need both types? If float
gives you enough precision for what you want to do, use the float
variants of the FFTW functions to get a large speed boost.
Compile with -Ofast -march=native
Use the compiler flags -Ofast -march=native
(or whatever the equivalent is for your compiler). This allows the compiler to vectorize your code using SSE instructions: -march=native
tells it to use the best vector instructions that your CPU supports, and -Ofast
tells it to cut a few corners when it comes to strict compliance to the standard, but typically it allows it to ignore things like denormal floating point numbers and other very unlikely things that might need extra instructions to handle, or that would even prevent vectorization. With these flags, GCC, Clang and ICC will all vectorize the loop that calculates the maximum absolute values.
Note that using fabs()
(or fabsf()
when using float
s) is very important to keep the code efficient. Chux already mentioned that in his answer.
Try splitting the maximum calculation loop in two
You have one loop that scans through two arrays simultaneously. That may or may not be less efficient than splitting this into two loops, one for each array. It is hard to predict which will be more efficient, this depends on how fast the CPU can crunch numbers vs. how much memory bandwidth it can sustain, how good the prefetcher of the CPU is, and how the compilers actually compile the loops. So you need to try out both variants yourself and measure the difference in performance.
Avoid scaling the result of the inverse DFT
Instead of dividing all elements of the array out
before returning from hilbert()
, don't scale the output, but instead just divide max2
by N
after calculating the maximum.
-
\$\begingroup\$ Hi, thanks for the reply. In first and the second parts, do you mean I have to set memory allocation and fftw_plan (does the create of IDFT also?)out of the "void"? \$\endgroup\$Kevin– Kevin2020年12月17日 22:03:31 +00:00Commented Dec 17, 2020 at 22:03
-
\$\begingroup\$ I'm not sure what you mean by "out of the void". The memory allocations and calls to
fftw_plan()
should be done outside the functions that are called to process the stream, so outsideSumBufferData()
andhilbert()
. \$\endgroup\$G. Sliepen– G. Sliepen2020年12月17日 22:17:37 +00:00Commented Dec 17, 2020 at 22:17 -
\$\begingroup\$ I am not really get the last part. Could you specify with code or else, thanks! \$\endgroup\$Kevin– Kevin2020年12月17日 22:57:57 +00:00Commented Dec 17, 2020 at 22:57
-
\$\begingroup\$ I tried to move
fftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
outside to thehilbert()
but buggy, such asplan
andout
need to redefine. Could you point out how to modify once thefftw_plan plan = fftw_plan_dft_1d(N, out, out, FFTW_FORWARD, FFTW_ESTIMATE);
is outside the void? \$\endgroup\$Kevin– Kevin2020年12月22日 19:46:46 +00:00Commented Dec 22, 2020 at 19:46 -
1\$\begingroup\$ Hi, here are some of my thought after following your guide: 1.Put the memory allocation out of the function didn't make it faster. 2. But move out and reuse the
plan
did upgrade the speed. 3. Splitting to twofor
loops didn't show any difference 4. Scaling the IDFT just dividemax2
byN
also didn't show difference too. But, Thanks! \$\endgroup\$Kevin– Kevin2021年01月06日 16:16:33 +00:00Commented Jan 6, 2021 at 16:16
Could anyone point out if there are any ways that I can optimize my code?
In general, there are no better-than-linear improvements here.
*alloc()
can be expensive. Since code is calling it twice and then freeing the pointers at the same time, save one expensive *alloc()
by doing 1 rather than 2.
With VLA support, it is easy
struct {
float ch1Buffer[u32Size];
double ch2Buffer[u32Size];
} *x;
x = calloc(1, sizeof *x);
... or allocate as an array of struct
.
struct {
float ch1Buffer;
double ch2Buffer;
} *x;
x = calloc(u32Size, sizeof *x);
ch1Buffer{]
is not even needed here. Code does not use the array in an array fashion.
Strange Code
Why +=
instead of =
? Assignment just as well here.
// AUC += Corrected;
AUC = Corrected;
Integer function used for FP code
OP commented that hilbertedch2
is an array with double
abs()
is for integers.
// if (abs(hilbertedch2[i][IMAG])> max2)
// max2 = abs(hilbertedch2[i][IMAG]);
if (fabs(hilbertedch2[i][IMAG])> max2)
max2 = fabs(hilbertedch2[i][IMAG]);
I'd expect using double
for max2
.
// float max2 = 0.0;
double max2 = 0.0;
If float Corrected
remains float
, use float
constants.
// if (Corrected > 0.1)
if (Corrected > 0.1f)
Tip: enable all compiler warnings.
Improve review and maintainability
Cast not needed. Size of the referenced object, rather than trying to match the type. Easier to code right, review and maintain.
// ch1Buffer = (float*)calloc(u32Size, sizeof(float));
ch1Buffer = calloc(u32Size, sizeof *ch1Buffer);
-
\$\begingroup\$ Thanks. In the allocation part, the x need to be declared to which kind of variable first? In the AUC part: += means calculate area under curve by summing the data points. \$\endgroup\$Kevin– Kevin2020年12月17日 21:22:41 +00:00Commented Dec 17, 2020 at 21:22
-
\$\begingroup\$ Yes.
some_type *ptr = malloc(sizeof *ptr * n);
works fine. \$\endgroup\$chux– chux2020年12月17日 21:30:05 +00:00Commented Dec 17, 2020 at 21:30 -
\$\begingroup\$ @Kevin
AUC += Corrected
is only used, at most, once. It is not in a loop. It does not sum points \$\endgroup\$chux– chux2020年12月17日 21:31:51 +00:00Commented Dec 17, 2020 at 21:31 -
\$\begingroup\$ Then which is better to present to sum the data points? \$\endgroup\$Kevin– Kevin2020年12月17日 21:37:58 +00:00Commented Dec 17, 2020 at 21:37
-
\$\begingroup\$ @Kevin You need a loop to sum the data points.
for, while, do, ...
\$\endgroup\$chux– chux2020年12月17日 22:01:52 +00:00Commented Dec 17, 2020 at 22:01
hilbertedch2
? \$\endgroup\$