I have a control problem that I am simulating in C. The premise is that I have some continuous-time signal s(t)
and I am given some length-150 array c = [c(0), c(dt), c(2dt), ...]
of controls. The array is a discretized version of a continuous-time control c(t)
, but I am not actually given access to the full c(t)
, only to the array c
with a fixed value of dt
.
Given c
and dt
, I create an array s = [s(0), s(dt), s(2dt), ...]
. I then need to compute the following function:
#include <math.h>
#include <complex.h>
#define SEQUENCE_LENGTH 150
double f(double s[], double c[], double dt) {
// s and c are both length SEQUENCE_LENGTH arrays
double complex alpha = 1. / sqrt(2) + 0.0 * I;
double complex beta = 1. / sqrt(2) + 0.0 * I;
double complex alphanew;
double complex betanew;
double x, y, norm, co, si;
for (int j = 0; j < SEQUENCE_LENGTH; j++) {
x = c[j]; y = s[j];
norm = sqrt(x*x + y*y);
co = cos(norm * dt);
si = sin(norm * dt) / norm;
alphanew = co * alpha - I * si * (y * alpha + x * beta);
betanew = co * beta - I * si * (x * alpha - y * beta);
alpha = alphanew; beta = betanew;
}
return (1 / 2.) * pow(cabs(alpha + beta), 2);
}
This function is called on the order of a hundred million times for various different c
and s
and it is the largest bottleneck in my simulation, so I really need to make it as fast as possible. Does anybody have any suggestions/tips/tricks to speed up this code?
Various thoughts. Currently, I compile with gcc
with the flags -Wall -Wextra -Werror -O3 -ffast-math
.
- Because I know
SEQUENCE_LENGTH
is 150, manually unrolling part of the loop would be straightforward, but I'm sure the compiler already does this. - I could perhaps vectorize some of the calculations in the loop, though again I assume the compiler is already doing this. Nonetheless I would still be interested in seeing an explicit vectorization of my code (for academic purposes), but I am not very familiar with vectorization.
- Is there any way to parallelize this code, eg with
#pragma amp parallel
? I don't see an obvious way.
1 Answer 1
I really need to make it as fast as possible.
Does anybody have any suggestions/tips/tricks to speed up this code?
Make a test harness
Form test code that rates the performance and precision correctness.
There simply are too many factors for a general fastest. We are currently left with faster suggestions. Yet with a use case specific test harness, we can reliably improve things (simpler trig/power functions, etc.)
float
vs. double
Often float
functions are faster.
Use float
objects and float
functions like sqrtf()
, sinf()
, powf()
, ...
Do you really need double
precision?
What is the precision goal? Optimization level
In requesting speed, one needs to identify the required precision as various speed optimizations can trade off correctness for time simply by enabling higher floating point compile-time optimizations levels.
Was ffast-math
really OK to use with an unstated precision goal?
Conversely hypot(x, y)
is often more precise than norm = sqrt(x*x + y*y);
. Is that OK?. OP has left no test harness and OP simply has not specified.
Range of data
State the range.
When double s[]
and double c[]
use a subset of the double
range, additional optimizations are possible. @user555045 This should be part of the test harness.
Use restrict
and const
Assuming s[]
and c[]
do not overlap, use restrict
.
This allows the compiler to make additional optimizations.
// double f(double s[], double c[], double dt) {
double f(const double * restrict s, const double * restrict c, double dt) {
Avoid loss of precision when no speed is achieved
Consider 1. / sqrt(2)
. Why isn't it sqrt(0.5)
?
Review the effect:
printf("%a\n", 1. / sqrt(2));
printf("%a\n", sqrt(0.5));
Output is not the same. If a speed difference exist, sqrt(0.5)
is likely faster.
0x1.6a09e667f3bccp-1
0x1.6a09e667f3bcdp-1
Of course we could use
// double complex alpha = 1. / sqrt(2) + 0.0 * I;
double complex alpha = 0.70710678118654752440084436210485 + 0.0 * I;
The premise is that I have some continuous-time signal s(t) ...
c = [c(0), c(dt), c(2dt), ...]
Architecture re-write
Consider performing the calculation as the data comes in doing a little bit of the long calculation during each dt
.
For real applications, I have found this to be the fastest: distributing the calculation rather than do it all after all the samples are acquired.
If dt
is sufficient for worse case 1 iteration of the for (int j = 0; j < SEQUENCE_LENGTH; j++) {
, then it should be easy to implement.
Else code might use 2 threads: one to acquire, one to compute.
Watch out for pitfalls
sin(norm * dt) / norm;
is perhaps a big problem when norm == 0.0
as it is 0.0/0.0
.
Consider safer code calculations approaches and avoid code that can crash quickly.
Good luck.
-
\$\begingroup\$ Thank you for this very nice answer! A couple of clarification question. Regarding range of data: given assumptions on the range of the input to eg sin/cos, is there a standard way (eg standard library or something) to calculate sin/cos, or should I look up implementations and hard code them? Regarding architecture re-write: I'm a little bit confused by what you mean by this -- could you expand upon it? \$\endgroup\$Joe– Joe2025年04月01日 18:32:11 +00:00Commented Apr 1 at 18:32
-
\$\begingroup\$ @Joe, The standard library does not offer limit range trig functions - so no standard way. What is the limited range in your use case? This is useful to narrow the wide range of alternatives. \$\endgroup\$chux– chux2025年04月01日 18:35:53 +00:00Commented Apr 1 at 18:35
-
\$\begingroup\$ Thanks! In my case, it is safe to assume that
dt < 0.1
andnorm < 10
, so the argument of sine and cosine should always be < 1. \$\endgroup\$Joe– Joe2025年04月01日 18:38:43 +00:00Commented Apr 1 at 18:38 -
\$\begingroup\$ One other question, if you don't mind :) Relating to what you said about distributing the calculation. Is there a way in parallel compute arrays of the values of the cosine and sine, and proceed with the for loop only as the relevant values are completed? For example, in one thread, create the array
[cos(norm[0]*dt), cos(norm[1]*dt), ...]
, and in another thread, once thej
th index of the array is ready, proceed through thej
th iteration of the alpha, beta for loop? This is something I'd love to learn how to do if it is helpful. \$\endgroup\$Joe– Joe2025年04月01日 18:39:24 +00:00Commented Apr 1 at 18:39 -
\$\begingroup\$ @Joe, architecture re-write implies that this
f()
calculation get distributed throughout the larger unposted code. Consider if onefor (int j = 0; j < SEQUENCE_LENGTH; j++)
occurred perdt
as the data arrived, then the finalf()
result woudl only take one more iteration, 150 times sooner than before. \$\endgroup\$chux– chux2025年04月01日 18:40:38 +00:00Commented Apr 1 at 18:40
sin
is divided bynorm
, whilecos
is not? \$\endgroup\$sincos
though, that's neat. The sine and cosine would be significant obstacles to manual vectorization as well, it's not impossible but please give any information that might help (eg limits on the range of the input to sin/cos, required degree of precision..) Often when there's sin+cos in a loop they can be replaced with vector rotations but looks like not this time. \$\endgroup\$c
array has entries between -1 and 1. Thes
array will typically have entires between -1 and 1 as well, although it in principle get much larger. Regarding precision, I would preferdouble
, but I guess less precision may be okay. You mentioned that sine and cosine are obstacle to manual vectorization -- can you expand on that? I don't know very much about vectorization, so I'd be curious to learn what exactly you mean here. \$\endgroup\$