Efficiently simulating a control sequence

Question 1

I have a control problem that I am simulating in C. The premise is that I have some continuous-time signal s(t) and I am given some length-150 array c = [c(0), c(dt), c(2dt), ...] of controls. The array is a discretized version of a continuous-time control c(t), but I am not actually given access to the full c(t), only to the array c with a fixed value of dt.

Given c and dt, I create an array s = [s(0), s(dt), s(2dt), ...]. I then need to compute the following function:

#include <math.h>
#include <complex.h>
#define SEQUENCE_LENGTH 150
double f(double s[], double c[], double dt) {
 // s and c are both length SEQUENCE_LENGTH arrays
 double complex alpha = 1. / sqrt(2) + 0.0 * I;
 double complex beta = 1. / sqrt(2) + 0.0 * I;
 double complex alphanew;
 double complex betanew;
 double x, y, norm, co, si;
 for (int j = 0; j < SEQUENCE_LENGTH; j++) {
 x = c[j]; y = s[j];
 norm = sqrt(x*x + y*y);
 co = cos(norm * dt);
 si = sin(norm * dt) / norm;
 alphanew = co * alpha - I * si * (y * alpha + x * beta);
 betanew = co * beta - I * si * (x * alpha - y * beta);
 alpha = alphanew; beta = betanew;
 }
 return (1 / 2.) * pow(cabs(alpha + beta), 2);
}

This function is called on the order of a hundred million times for various different c and s and it is the largest bottleneck in my simulation, so I really need to make it as fast as possible. Does anybody have any suggestions/tips/tricks to speed up this code?

Various thoughts. Currently, I compile with gcc with the flags -Wall -Wextra -Werror -O3 -ffast-math.

Because I know SEQUENCE_LENGTH is 150, manually unrolling part of the loop would be straightforward, but I'm sure the compiler already does this.
I could perhaps vectorize some of the calculations in the loop, though again I assume the compiler is already doing this. Nonetheless I would still be interested in seeing an explicit vectorization of my code (for academic purposes), but I am not very familiar with vectorization.
Is there any way to parallelize this code, eg with #pragma amp parallel? I don't see an obvious way.

Question 2

Is it intended that only sin is divided by norm, while cos is not?

Question 3

"I assume the compiler is already doing this" - I didn't and checked it: doesn't happen. It typically doesn't when heavy functions like sin and cos are involved. GCC managed to merge them into sincos though, that's neat. The sine and cosine would be significant obstacles to manual vectorization as well, it's not impossible but please give any information that might help (eg limits on the range of the input to sin/cos, required degree of precision..) Often when there's sin+cos in a loop they can be replaced with vector rotations but looks like not this time.

Question 4

@vnp Yes that is intended. The point is that (alpha, beta) defines a complex unit vector throughout the computation.

Question 5

@user555045 Hm interesting, thanks for that link! Regarding limits, I can enforce the assumption that the c array has entries between -1 and 1. The s array will typically have entires between -1 and 1 as well, although it in principle get much larger. Regarding precision, I would prefer double, but I guess less precision may be okay. You mentioned that sine and cosine are obstacle to manual vectorization -- can you expand on that? I don't know very much about vectorization, so I'd be curious to learn what exactly you mean here.

Question 6

Also, if it helps, the transformation that's happening is $$\begin{pmatrix} \alpha \\ \beta \end{pmatrix} \mapsto \begin{pmatrix} \cos(h dt) - i \sin(h dt) \frac{y}{h} & - i \sin(h dt) \frac{x}{h} \\ -i \sin(h dt) \frac{x}{h} & \cos(h dt) + i \sin(h dt) \frac{y}{h} \end{pmatrix} \begin{pmatrix} \alpha \\ \beta \end{pmatrix},$$ where h = norm = sqrt(x^2 + y^2)

Question 7

I really need to make it as fast as possible.

Does anybody have any suggestions/tips/tricks to speed up this code?

Make a test harness

Form test code that rates the performance and precision correctness.

There simply are too many factors for a general fastest. We are currently left with faster suggestions. Yet with a use case specific test harness, we can reliably improve things (simpler trig/power functions, etc.)

float vs. double

Often float functions are faster.

Use float objects and float functions like sqrtf(), sinf(), powf(), ...

Do you really need double precision?

What is the precision goal? Optimization level

In requesting speed, one needs to identify the required precision as various speed optimizations can trade off correctness for time simply by enabling higher floating point compile-time optimizations levels.

Was ffast-math really OK to use with an unstated precision goal?

Conversely hypot(x, y) is often more precise than norm = sqrt(x*x + y*y);. Is that OK?. OP has left no test harness and OP simply has not specified.

Range of data

State the range.

When double s[] and double c[] use a subset of the double range, additional optimizations are possible. @user555045 This should be part of the test harness.

Use restrict and const

Assuming s[] and c[] do not overlap, use restrict.

This allows the compiler to make additional optimizations.

// double f(double s[], double c[], double dt) {
double f(const double * restrict s, const double * restrict c, double dt) {

Avoid loss of precision when no speed is achieved

Consider 1. / sqrt(2). Why isn't it sqrt(0.5)?

Review the effect:

 printf("%a\n", 1. / sqrt(2));
 printf("%a\n", sqrt(0.5));

Output is not the same. If a speed difference exist, sqrt(0.5) is likely faster.

0x1.6a09e667f3bccp-1
0x1.6a09e667f3bcdp-1

Of course we could use

// double complex alpha = 1. / sqrt(2) + 0.0 * I;
double complex alpha = 0.70710678118654752440084436210485 + 0.0 * I;

The premise is that I have some continuous-time signal s(t) ...
c = [c(0), c(dt), c(2dt), ...]

Architecture re-write

Consider performing the calculation as the data comes in doing a little bit of the long calculation during each dt.

For real applications, I have found this to be the fastest: distributing the calculation rather than do it all after all the samples are acquired.

If dt is sufficient for worse case 1 iteration of the for (int j = 0; j < SEQUENCE_LENGTH; j++) {, then it should be easy to implement.

Else code might use 2 threads: one to acquire, one to compute.

Watch out for pitfalls

sin(norm * dt) / norm; is perhaps a big problem when norm == 0.0as it is 0.0/0.0.

Consider safer code calculations approaches and avoid code that can crash quickly.

Good luck.

Question 8

Thank you for this very nice answer! A couple of clarification question. Regarding range of data: given assumptions on the range of the input to eg sin/cos, is there a standard way (eg standard library or something) to calculate sin/cos, or should I look up implementations and hard code them? Regarding architecture re-write: I'm a little bit confused by what you mean by this -- could you expand upon it?

Question 9

@Joe, The standard library does not offer limit range trig functions - so no standard way. What is the limited range in your use case? This is useful to narrow the wide range of alternatives.

Question 10

Thanks! In my case, it is safe to assume that dt < 0.1 and norm < 10, so the argument of sine and cosine should always be < 1.

Question 11

One other question, if you don't mind :) Relating to what you said about distributing the calculation. Is there a way in parallel compute arrays of the values of the cosine and sine, and proceed with the for loop only as the relevant values are completed? For example, in one thread, create the array [cos(norm[0]*dt), cos(norm[1]*dt), ...], and in another thread, once the jth index of the array is ready, proceed through the jth iteration of the alpha, beta for loop? This is something I'd love to learn how to do if it is helpful.

Question 12

@Joe, architecture re-write implies that this f() calculation get distributed throughout the larger unposted code. Consider if one for (int j = 0; j < SEQUENCE_LENGTH; j++) occurred per dt as the data arrived, then the final f() result woudl only take one more iteration, 150 times sooner than before.

chux chux 36.2k2 gold badges43 silver badges96 bronze badges · Accepted Answer · 2025-04-01 04:30:50Z

I really need to make it as fast as possible.

Does anybody have any suggestions/tips/tricks to speed up this code?

Make a test harness

Form test code that rates the performance and precision correctness.

There simply are too many factors for a general fastest. We are currently left with faster suggestions. Yet with a use case specific test harness, we can reliably improve things (simpler trig/power functions, etc.)

float vs. double

Often float functions are faster.

Use float objects and float functions like sqrtf(), sinf(), powf(), ...

Do you really need double precision?

What is the precision goal? Optimization level

In requesting speed, one needs to identify the required precision as various speed optimizations can trade off correctness for time simply by enabling higher floating point compile-time optimizations levels.

Was ffast-math really OK to use with an unstated precision goal?

Conversely hypot(x, y) is often more precise than norm = sqrt(x*x + y*y);. Is that OK?. OP has left no test harness and OP simply has not specified.

Range of data

State the range.

When double s[] and double c[] use a subset of the double range, additional optimizations are possible. @user555045 This should be part of the test harness.

Use restrict and const

Assuming s[] and c[] do not overlap, use restrict.

This allows the compiler to make additional optimizations.

// double f(double s[], double c[], double dt) {
double f(const double * restrict s, const double * restrict c, double dt) {

Avoid loss of precision when no speed is achieved

Consider 1. / sqrt(2). Why isn't it sqrt(0.5)?

Review the effect:

 printf("%a\n", 1. / sqrt(2));
 printf("%a\n", sqrt(0.5));

Output is not the same. If a speed difference exist, sqrt(0.5) is likely faster.

0x1.6a09e667f3bccp-1
0x1.6a09e667f3bcdp-1

Of course we could use

// double complex alpha = 1. / sqrt(2) + 0.0 * I;
double complex alpha = 0.70710678118654752440084436210485 + 0.0 * I;

The premise is that I have some continuous-time signal s(t) ...
c = [c(0), c(dt), c(2dt), ...]

Architecture re-write

Consider performing the calculation as the data comes in doing a little bit of the long calculation during each dt.

For real applications, I have found this to be the fastest: distributing the calculation rather than do it all after all the samples are acquired.

If dt is sufficient for worse case 1 iteration of the for (int j = 0; j < SEQUENCE_LENGTH; j++) {, then it should be easy to implement.

Else code might use 2 threads: one to acquire, one to compute.

Watch out for pitfalls

sin(norm * dt) / norm; is perhaps a big problem when norm == 0.0as it is 0.0/0.0.

Consider safer code calculations approaches and avoid code that can crash quickly.

Good luck.

Thank you for this very nice answer! A couple of clarification question. Regarding range of data: given assumptions on the range of the input to eg sin/cos, is there a standard way (eg standard library or something) to calculate sin/cos, or should I look up implementations and hard code them? Regarding architecture re-write: I'm a little bit confused by what you mean by this -- could you expand upon it?
@Joe, The standard library does not offer limit range trig functions - so no standard way. What is the limited range in your use case? This is useful to narrow the wide range of alternatives.
Thanks! In my case, it is safe to assume that dt < 0.1 and norm < 10, so the argument of sine and cosine should always be < 1.
One other question, if you don't mind :) Relating to what you said about distributing the calculation. Is there a way in parallel compute arrays of the values of the cosine and sine, and proceed with the for loop only as the relevant values are completed? For example, in one thread, create the array [cos(norm[0]*dt), cos(norm[1]*dt), ...], and in another thread, once the jth index of the array is ready, proceed through the jth iteration of the alpha, beta for loop? This is something I'd love to learn how to do if it is helpful.
@Joe, architecture re-write implies that this f() calculation get distributed throughout the larger unposted code. Consider if one for (int j = 0; j < SEQUENCE_LENGTH; j++) occurred per dt as the data arrived, then the final f() result woudl only take one more iteration, 150 times sooner than before.

Stack Exchange Network

Efficiently simulating a control sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Efficiently simulating a control sequence

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions