Fast implementation of complementary error function with 5-digit accuracy

Question 1

[While this is a self-answered question, I will happily upvote and accept any alternative answer that either provides superior accuracy for the same amount of computational work or reduces the amount of computational work while maintaining the same level of accuracy.]

I have previously demonstrated how to compute the complementary error function erfcf() with a maximum error of less than three ulps. This can serve as a building block for additional functions, such as the CDF of the standard normal distribution Φ(x) = 1⁄2 erfc (-√1⁄2 x) or the Gaussian Q-function, Q(x) = 1-Φ(x) = 1⁄2 erfc (√1⁄2 x). However, for some use cases computation fully accurate to single precision is not needed while the contribution of erfc() evaluations to total run-time is not negligible.

The literature provides various low-accuracy approximations to the complementary error function, but they are either limited to a subset of the full input domain, or optimized for absolute error, or computationally too complex, for example by requiring multiple invocations of transcendental functions. How could one go about implementing erfcf() with high performance and with a relative accuracy of about 5 decimal digits across the entire input domain?

Question 2

What is the purpose of this question and answer. If you want a feedback you need to post to codereview.stackexchange.com

Question 3

@gulpr: The purpose is to put information on record, furthering Stack Overflow’s goal of being a durable repository of programming questions and answers. Stack Overflow encourages self-answered questions.

Question 4

@njuffa, Are you trying to make a faster, yet perhaps less accurate, erfc() for float than the standard library erfcf() or re-inventing one from scratch as an exercise? IOWs, why would one want to use this fast_erfcf() rather than erfcf() which is 70% faster and has 2.5x better accuracy?

Question 5

Reference: wolfram erfc()

Question 6

@chux-ReinstateMonica Example benchmark: Sum erfc function values for all floats between zero and 10.5f in 1 ulp increments. Xeon W-2133, MSVC 19.27.29112, cl /W4 /O2 /arch:AVX512 /favor:INTEL64 /fp:precise. Using MSVC's built-in erfcf(): 44.5 seconds. Using fast_erfcf() with USE_FMA=0: 26.8 seconds. Using fast_erfcf() with USE_FMA=1: 20.5 seconds. This is just an example, and your use case may suggest other kinds of benchmarking.

Question 7

How could one go about implementing erfcf() with high performance and with a relative accuracy of about 5 decimal digits across the entire input domain?

This answer, like OP's answer, does achieve about 5 decimal digits relative accuracy^*1. To achieve at least 5 likely obliges another term in the rational polynomial - which I doubt will slow the performance much.

Special values

In addition to performance and accuracy, the result of a my_erfc(x) for select values should be considered.

For x near 0.0, the result should be 1.0 exactly. This follows principle of least surprise.

Example: A user may readily rely on my_cos(0.0) == 1.0, or my_exp(0.0) == 1.0, so should my_erfc(0.0) == 1.0

Fortunately this is not hard to accomplish when a rational polynomial is used with coefficients of 1.0. See below.

Take advantage of the linear region

About 45% of all float are such that the erfc() is a simple linear equation with high accuracy.

#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
 if (fabsf(x) < ERFC_SMALL) {
 return 1.0f - TWO_OVER_ROOT_PI * x;
 }
 ...
}

Define performance rating

One could rate code in a worst case, what is the slowest for all float x?

I suspect OP is more interested in an assessment from x= 0.0 up to when erfc(x) is 0.0 (~x == 10) in a linear progression.

Sample performance results and test code follows. Of course, hard to beat the C library's crafted code.

0 time: 0.594 erfcf (Library function)
1 time: 0.703 fast_erfcf (OP's answer)
2 time: 0.688 fast_erfcf_alt (This answer)

int main(void) {
 float (*f[3])(float) = {erfcf, fast_erfcf, fast_erfcf_alt };
 for (int i = 0; i < 3; i++) {
 clock_t c0, c1;
 c0 = clock();
 float dx = 10.0f - nextafterf(10, 0);
 for (float x = 10.0f; x >= 0.0f; x -= dx) {
 f[i](x);
 }
 c1 = clock();
 printf("%d time: %g\n", i, (double) (c1 - c0) / CLOCKS_PER_SEC);
 }
}

Following code is modeled after OP self answer and incorporates ideas mentioned above.

#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
 if (fabsf(x) < ERFC_SMALL) {
 return 1.0f - TWO_OVER_ROOT_PI * x;
 }
 float a, c, e, p, q, r, s;
 a = fabsf(x);
 c = fminf(a, 10.5f);
 s = -c * c;
#if USE_BUILTIN_EXP
 e = expf(s);
#else // USE_BUILTIN_EXP
 e = my_expf (s);
#endif // USE_BUILTIN_EXP
 q = 0.374177223624056f;
 p = -5.00032254520701E-05f;
 q = q * c + 1.29051354328887f;
 p = p * c + 0.212358010453875f;
 q = q * c + 1.84437448399707f;
 p = p * c + 0.715675302663111f;
 q = q * c + 1.0f;
 p = p * c + 1.0f;
 r = e / q;
 r = r * p;
 if (x < 0.0f)
 r = 2.0f - r;
 if (isnan(x))
 r = x + x;
 return r;
}

Note: I'd consider eliminating c = fminf(a, 10.5f);.

This code's error results using OP's answer test harness.
(#define USE_FMA (0) and #define USE_BUILTIN_EXP (1)).

It's relative error (1.20e-05) vs. OP's error (1.06e-05) is primarily due to using 1.0 in both polynomials p and q.

Other code could use OP's answer's p and q and this answer's if (fabs(x) < ERFC_SMALL) for a best of both. Additional work needed to see how well the linear and rational polynomial sections meet.

 * This answer
 * max rel err = 1.203212e-05 @ 9.19353199e+00
 * max abs err = 9.690295e-06 @ -7.19872937e-02
 * max ulp err = 1.877379e+02 @ 8.03564072e+00
 *
 * OP's answer
 * max rel err = 1.061202e-05 @ 4.35268253e-01
 * max abs err = 9.398669e-06 @ -9.19871107e-02
 * max ulp err = 1.755724e+02 @ 2.46583080e+00

As an extra, below is a graphical view of OP's error over the [0... ~10] range. I found it illuminating to show where errors develop.

erfc error

^*1 Low relative accuracy is not possible in the sub-normal range of answers.

Question 8

I will look at the updated answer shortly. FWIW, triggered by your earlier comment about the lack of a smooth transition at zero (due to minimax approximations having error peaks at both ends, one end being at zero here), I had started to investigate what would happen to accuracy if I constrain the leading coefficients of both numerator and denominator to 1.0f. Based on your accuracy data here, it seems your approach works somewhat better, which is interesting.

Question 9

@njuffa, "if I constrain the leading coefficients of both numerator and denominator to 1.0f" --> might get a like result as this answer.

Question 10

My attempts to generate rational approximations incorporating that constraint have all resulted in utter failure (probably due to misconceptions on my part how to go about it; this is not something I normally do). I have verified the stated error bounds for your set of coefficients on my local system, and am presently running the set through my heuristic tuner to see how close it can move the error bounds towards those provided in my answer.

Question 11

@njuffa FWIW, your question has been fun and a learning experience for me.

Question 12

Thank you for the helpful answer. My approximation tuner with USE_BUILTIN_EXP==1 converged to this:

/* max rel err = 1.137295e-5 @ 8.01219559; max abs err = 9.512059e-6 @ -6.88479543e-2; max ulp err = 188.2552 @ 4.55690527 */ q = 3.74177814e-1f; p = -4.98741865e-5f; q = q * c + 1.29051733e+0f; p = p * c + 2.12357596e-1f; q = q * c + 1.84437072e+0f; p = p * c + 7.15674818e-1f; q = q * c + 1.0f; p = p * c + 1.0f;

Question 13

The following assumes a platform that complies with the IEEE-754 (2008) floating-point standard, on which float is mapped to the IEEE-754 binary32 format, and that uses the same endianness between 32-bit integers and float. It is further assumed that a C toolchain is available which (by setting appropriate command-line switches, if need be) preserves IEEE-754 semantics. I used the Intel C/C++ compiler with the switches -march=skylake-avx152 -O3 -fp-model=precise.

As the complementary error function is symmetric about (0,1), one can focus on function inputs in the positive half-plane. Here the function decays roughly like exp(-x²), and with float computation it underflows to zero for arguments x > 10.5. If one plots erfc(x) / exp(-x²) on [0, 10.5] the shape suggests that it is somewhat difficult to approximate with a polynomial, but should be easily approximated by a rational function, that is, the ratio of two polynomials. Some initial experiments showed that two polynomials each of degree 3 should be sufficient to achieve five-digit accuracy.

While there are many tools available that can generate polynomial approximations, this is unfortunately not the case with rational approximations. I used a modification of the Remez algorithm to generate an initial minimax approximation R(x) = P(x)/Q(x) to erfc(x) / exp(-x²), but had to follow-up with a fairly extensive heuristic search to arrive at an approximation that provides close to equi-oscillation of the error peaks and almost achieves a relative error of 10^-5, with remaining differences being negligible for my needs.

By computing erfc(x) = exp(-x²) R(x) the accuracy achieved is obviously dependent on the accuracy of a platform's expf() implementation. In my experience, faithfully-rounded implementations of this function (with maximum error <= 1 ulp) are common. While the Intel compiler comes with a highly accurate math library that provides a nearly correctly-rounded implementation (maximum error very close to 0.5 ulps), I also tried my own faithfully-rounded alternative my_expf() with larger error and observed only a very minor impact on accuracy of fast_erfcf().

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
#include <string.h>
#include <math.h>
#define USE_FMA (1)
#define USE_BUILTIN_EXP (0)
#if !USE_BUILTIN_EXP
float my_expf (float a);
#endif // USE_BUILTIN_EXP
/* Fast computation of the complementary error function. For argument x > 0
 erfc(x) = exp(-x*x) * P(x) / Q(x), where P(x) and Q(x) are polynomials. 
 If expf() is faithfully rounded, the following error bounds should hold:
 Maximum relative error < 1.065e-5, maximum absolute error < 9.50e-6, and 
 maximum ulp error < 176.5
*/
float fast_erfcf (float x)
{
 float a, c, e, p, q, r, s;
 a = fabsf (x);
 c = fminf (a, 10.5f);
 s = -c * c;
#if USE_BUILTIN_EXP
 e = expf (s);
#else // USE_BUILTIN_EXP
 e = my_expf (s);
#endif // USE_BUILTIN_EXP
#if USE_FMA
 q = 3.82346243e-1f; // 0x1.8785c6p-2
 p = -4.38094139e-5f; // -0x1.6f8000p-15
 q = fmaf (q, c, 1.30382288e+0f); // 0x1.4dc756p+0
 p = fmaf (p, c, 2.16852024e-1f); // 0x1.bc1ceap-3
 q = fmaf (q, c, 1.85278833e+0f); // 0x1.da5056p+0
 p = fmaf (p, c, 7.23953605e-1f); // 0x1.72aa0cp-1
 q = fmaf (q, c, 9.99991655e-1f); // 0x1.fffee8p-1
 p = fmaf (p, c, 1.00000000e+0f); // 0x1.000000p+0
#else // USE_FMA
 q = 3.82346272e-1f; // 0x1.8785c8p-2f
 p = -4.38243151e-5f; // -0x1.6fa000p-15
 q = q * c + 1.30382371e+0f; // 0x1.4dc764p+0
 p = p * c + 2.16852218e-1f; // 0x1.bc1d04p-3
 q = q * c + 1.85278797e+0f; // 0x1.da5050p+0
 p = p * c + 7.23953605e-1f; // 0x1.72aa0cp-1
 q = q * c + 9.99991596e-1f; // 0x1.fffee6p-1
 p = p * c + 1.00000000e+0f; // 0x1.000000p+0
#endif // USE_FMA
 r = e / q;
 r = r * p;
 if (x < 0.0f) r = 2.0f - r;
 if (isnan(x)) r = x + x;
 return r;
}
float uint32_as_float (uint32_t a)
{
 float r;
 memcpy (&r, &a, sizeof r);
 return r;
}
/* Exponential function. Maximum error 0.86565 ulps */
float my_expf (float a)
{
 float f, r, j, s, t;
 int i;
 unsigned int ia;
 // exp(a) = 2**i * exp(f); i = rintf (a / log(2))
 j = fmaf (1.442695f, a, 12582912.f); // 0x1.715476p0 // log2(e)
 j = j - 12582912.f; // 0x1.8p23 // 2**23+2**22
 f = fmaf (j, -6.93145752e-1f, a); // -0x1.62e400p-1 // log_2_hi 
 f = fmaf (j, -1.42860677e-6f, f); // -0x1.7f7d1cp-20 // log_2_lo 
 i = (int)j;
 // approximate r = exp(f) on interval [-log(2)/2, +log(2)/2]
 r = 1.37805939e-3f; // 0x1.694000p-10
 r = fmaf (r, f, 8.37312452e-3f); // 0x1.125edcp-7
 r = fmaf (r, f, 4.16695364e-2f); // 0x1.555b5ap-5
 r = fmaf (r, f, 1.66664720e-1f); // 0x1.555450p-3
 r = fmaf (r, f, 4.99999851e-1f); // 0x1.fffff6p-2
 r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
 r = fmaf (r, f, 1.00000000e+0f); // 0x1.000000p+0
 // exp(a) = 2**i * r
 ia = (i > 0) ? 0 : 0x83000000;
 s = uint32_as_float (0x7f000000 + ia);
 t = uint32_as_float ((i << 23) - ia);
 r = r * s;
 r = r * t;
 // handle special cases: severe overflow / underflow
 if (fabsf (a) >= 104.0f) r = (a < 0) ? 0.0f : INFINITY;
 return r;
}
uint32_t float_as_uint32 (float a)
{
 uint32_t r;
 memcpy (&r, &a, sizeof r);
 return r;
}
uint64_t double_as_uint64 (double a)
{
 uint64_t r;
 memcpy (&r, &a, sizeof r);
 return r;
}
double floatUlpErr (float res, double ref)
{
 uint64_t refi, i, j, err;
 int expoRef;
 
 /* ulp error cannot be computed if either operand is NaN, infinity, zero */
 if (isnan (res) || isnan (ref) || isinf(res) || isinf (ref) ||
 (res == 0.0f) || (ref == 0.0)) {
 return 0.0;
 }
 i = ((int64_t)float_as_uint32 (res)) << 32;
 expoRef = (int)(((double_as_uint64 (ref) >> 52) & 0x7ff) - 1023);
 refi = double_as_uint64 (ref);
 if (expoRef >= 129) {
 j = (refi & 0x8000000000000000ULL) | 0x7fffffffffffffffULL;
 } else if (expoRef < -126) {
 j = ((refi << 11) | 0x8000000000000000ULL) >> 8;
 j = j >> (-(expoRef + 126));
 j = j | (refi & 0x8000000000000000ULL);
 } else {
 j = ((refi << 11) & 0x7fffffffffffffffULL) >> 8;
 j = j | ((uint64_t)(expoRef + 127) << 55);
 j = j | (refi & 0x8000000000000000ULL);
 }
 err = (i < j) ? (j - i) : (i - j);
 return err / 4294967296.0;
}
int main (void)
{
 uint32_t argi, resi, refi, diff;
 float arg, res, reff, abserrloc = NAN, relerrloc = NAN, ulperrloc = NAN;
 double ref, relerr, abserr, ulperr;
 double maxabserr = 0, maxrelerr = 0, maxulperr = 0;
 argi = 0;
 do {
 arg = uint32_as_float (argi);
 ref = erfc ((double)arg);
 res = fast_erfcf (arg);
 reff = (float)ref;
 resi = float_as_uint32 (res);
 refi = float_as_uint32 (reff);
 ulperr = floatUlpErr (res, ref);
 if (ulperr > maxulperr) {
 ulperrloc = arg;
 maxulperr = ulperr;
 }
 abserr = fabs (res - ref);
 if (abserr > maxabserr) {
 abserrloc = arg;
 maxabserr = abserr;
 }
 if (fabs (ref) >= 0x1.0p-126) {
 relerr = fabs ((res - ref) / ref);
 if (relerr > maxrelerr) {
 relerrloc = arg;
 maxrelerr = relerr;
 }
 }
 diff = (resi > refi) ? (resi - refi) : (refi - resi);
 if (diff > 200) {
 printf ("diff=%u @ %15.8e : res=% 15.8e ref=% 15.8e\n", 
 diff, arg, res, ref);
 return EXIT_FAILURE;
 }
 argi++;
 } while (argi);
 printf ("max rel err = %.6e @ % 15.8e\n"
 "max abs err = %.6e @ % 15.8e\n"
 "max ulp err = %.6e @ % 15.8e\n",
 maxrelerr, relerrloc, 
 maxabserr, abserrloc,
 maxulperr, ulperrloc);
 return EXIT_SUCCESS;
}

Question 14

Please append the output you saw. (Approximate run time useful too.)

Question 15

I saw

max rel err = 1.060158e-05 @ 1.12632227e+00 max abs err = 9.462895e-06 @ -9.60391834e-02 max ulp err = 1.761760e+02 @ 1.08563566e+00

at about 10 minutes.

Question 16

if (isnan(x)) r = x + x; deserves a comment. What is the goal of this line? if (isnan(x)) return x; in the function top seems like a better alternative.

Question 17

If you're going to rely on exp, processing the domain side looks like an easier fit: wolframalpha.com/input?i=plot+log%28erfc%28x%29%29+from+0+to+10

Question 18

@chux-ReinstateMonica I am not sure why NaN-handling would require a special comment. Special cases like NaN should be rare and don't have to be handled fast. In terms of making the common path fast, it is often better to handle them at the end as a conditional overwrite, so that is the pattern I have mostly been using the past 20 years. The x+x is a common idiom that triggers an exception when an SNAN is encountered with FP exceptions unmasked, and forces SNAN to QNAN conversion when FP exceptions are masked, which is IEEE-754 compliant.

Question 19

OP has provided a comment with the range used to assess performance testing: 0.0 to 10.5 in 1 ULP steps.

As about 40% of all float [-1.6e-08 ... +1.6e-08] return a value of about 1.0, code could use the following. For me, the overall performance at least doubled.

float fast_erfcf(float x) {
 if (fabsf(x) <= 1.60e-8) { // or maybe a bit more.
 return 1.0;
 }
 ...
}

Since such a large range of float in erfc(x) can return 1.0, I recommend this up front test.

Advantages include:

Much faster for for 40% of all float.
erfc(x_near_0) are expected to return 1.0. OP's fast_erfcf(0.0) returns 1.00000834. Acceptable, given the large ULP being tolerated. Still 1.0 is nice.
OP's fast_erfcf(small negative value) returns 0.999991536. Acceptable, given the large ULP being tolerated, yet it is desirable to have erfc(x) >= erfc(next x towards + infinity) when possible.

Also: Suggest performance testing for the most interesting values is conducted in the [0.1 .... 10.0] range.

This suggestion did not help OP much, so moving it to the bottom of this answer.

Instead of performing e^x with my_expf(float a) consider 2^x with say my_exp2f(float a). (This will impact the constants of the rational function.)

Effectively j = fmaf(1.442695f, a, 12582912.f); is removed.

Performing exponentiation is slightly easier with base 2 than base e.

I estimate a few percent faster code (~3%), and not much change in ULP results.

True, a micro-optimization, but it is effectively free as it does not add any real complexity to the link answer's approach.

Question 20

As explained in the description, my_expf() was just used to demonstrate that it does not matter much whether expf() is nearly correctly rounded or more generally faithfully rounded. For what it is worth, expf() vs exp2f() is often a wash in terms of performance, and on some platforms expf() is more highly optimized.

Question 21

True about possibilities of expf() vs exp2f(), yet certainly your my_expf() vs my_exp2f() can have the slightly faster my_exp2f() as it needs 1 less line of code without code increases elsewhere, thus "reduces the amount of computational work while maintaining the same level of accuracy".

Question 22

If we were to use exp2f() instead of expf() we need a different rational approximation R(x) and find that the function erfc(x)/exp2(x) that needs to be approximated is harder to approximate requiring higher degree polynomials, so overall it is a loss performance-wise. At least this is what I found when I tried it.

Question 23

@njuffa Hmm, that was not expected. IAC, you already have a set up to assess things and I am building one up. It will take some time to validate.

Question 24

I completely agree that relative accuracy is meaningless for subnormals, for the reason you state. This is why my test framework, as posted, does not record relative accuracy for those cases, and also uses ulp error as a backup. It boils down to a question of presenting the issue, and this is the best way I managed of presenting it, considering that not everybody reading the [C] tag is aware of floating-point details you and I are aware of.

Question 25

I have to say that erfc(x) put up more of a fight than I was expecting but I now finally have a solution that is faster, only slightly more complex and 2x more accurate. It is faster although that advantage may be lost if @njuffa switches to double precision arithmetic internally. I'll start another thread about speed since there are some curiosities about FMA code timings that I really don't understand.

Tests were done in the OP's test harness with a minor modification to time each run.

To obtain a solution I had to modify my version of the ARM remez code and use a custom weighting to get the errors right down at the high end of the range. Incidentally reducing the range slightly to 10.0 (or looking at @chux graph) perhaps as low as 9.5 will give full 5 figure relative error accuracy.

The function I modelled as a rational P(x)/Q(x) was

 (erfc(x)/exp(-x^2)-1)/x (excluding x==0)

The formal solution obtained by ARMremez code (modified) is:

 iter = 6 maxerr = 3.28148019434200456568158011254156762066636830e-06 
 (1.0+x*(0.8991604049044439664180350803894074305+ 
 x*(0.2900325421849880048302980017673262815200+
 x*(-6.0399785602952803731803658009133843823e-06))))
 /(-0.886123058281072972284447446285224409290+
 x*(-1.58327531146239316863922987677501888438287+
 x*(-1.065217231856749807713459915046806520098096+
 x*(-0.28983820234089701058660284313744428693980))))

Coefficients computed to 80 digits and truncated to indicated accuracy.

This is the code that resulted with coefficients retained to double precision digits but truncated to floating point by the compiler. I haven't been able to obtain a working pure float implementation (although it may be possible). There may be no point since doubles are faster!

The accuracy isn't quite what the Remez algorithm predicts but I have spent no time trying to look for a better coefficient set. The resulting C function is:

float d_fast_erfcf(float x) 
 // computation entirely in double precision to rule out rounding glitches
 // coefficients declared as floats (remove trailing f to get doubles)
{
 double a, e, r, s, c, p, q;
 a = fabs(x);
 c = fmin(a, 10.5);
 s = -c * c;
 e = exp(s);
 p = 1.0f + c * (0.89916040490444396641803508f + c * (0.29003254218498800483029800f + c * (-6.039978560295280373180365800e-06f)));
 q = -0.88612305828107297228444744628522f + c * (-1.5832753114623931686392298767750f + c * (-1.0652172318567498077134599150f + c * (-0.2898382023408970105866028431f)));
 r = e * (1+ c * p / q);
 if (x < 0.0) r = 2.0 - r;
 if (isnan(x)) r = x + x;
 return (float) r;
}
// Results uing MS V C/C++ 2022
// Pure float implementation is no good
// diff count 0 time taken 168.5
//
// max rel err = 1.28762e-05 @ 9.12920189 0x1.24226cp+3
// max abs err = 3.60758e-06 @ -0.0615047403 -0x1.f7d8cap-5
// max ulp err = 200 @ 8.77462292 0x1.18c9b6p+3
//
// float coefficients with double arithmetic is best buy
// diff count 0 time taken 108.1
// max rel err = 5.56595e-06 @ 6.25805283 0x1.9083f0p+2
// max abs err = 3.543e-06 @ -0.0637716576 -0x1.05356ep-4
// max ulp err = 128 @ 0.813420653 0x1.a078acp-1
//
// pure double coefficients and double arithmetic
// diff count 0 time taken 108.4
//
// max rel err = 5.5498e-06 @ 9.15682793 0x1.2504bcp+3
// max abs err = 3.54332e-06 @ -0.0638476089 -0x1.058512p-4
// max ulp err = 128 @ 0.813420653 0x1.a078acp-1
//
// OP original version
// diff count 0 time taken 195.34
//
// max rel err = 1.06016e-05 @ 1.12632227 0x1.2056a8p+0
// max abs err = 9.4629e-06 @ -0.0960391834 -0x1.896062p-4
// max ulp err = 192 @ 8.45171871e-06 0x1.1b97b6p-17
//
// Full disclosure quick test on Intel C++ 2023 full doubles
// diff count 0 time taken 252.868
// max rel err = 5.5498e-06 @ 9.15682793 0x1.2504bcp+3
// max abs err = nan @ -0.0638476089 -0x1.058512p-4
// max ulp err = 128 @ 0.813420653 0x1.a078acp-1
//
// I don't like the look of max abs error = nan !

The other quirk is that Intel uses concise FMA instructions whereas Microsoft uses separate multiply and add. The latter apparently runs quicker here!

I hope this provides you with a new set of coefficients to play with. I'd be interested to know what your rational Remex optimiser gets for that function.

Question 26

At first look the improved accuracy derives from effectively adding one more term to the approximation, at the additional cost of one 1M + 1A?

Question 27

For a pure float implementation using FMAs, my tuner converged to:

/* max rel err = 9.323338e-6 @ 6.42245102; max abs err = 3.633895e-6 @ -6.50273114e-2; max ulp err = 152.7799 @ 6.42245102 */ p = -6.03997978e-6f; p = fmaf (p, c, 2.90032744e-1f); p = fmaf (p, c, 8.99160624e-1f); p = fmaf (p, c, 1.0f); q = -2.89838761e-1f; q = fmaf (q, c, -1.06521523e+0f); q = fmaf (q, c, -1.58327532e+0f); q = fmaf (q, c, -8.86123061e-1f); r = fmaf (p / q, c, 1.0f) * e;

As one would expect, increasing the degree of the approximation by one improves the accuracy.

Question 28

Pretty much. Anything fancier that I tried works well for modest x but goes haywire when you extend the range beyond x~5 (or vice versa). This series looked promising and fairly convergent but I can't tame it with my Remez code at present. (Maxima): (%i14) taylor((erfc(x)-1)*sqrt(%pi)/2+(exp(-x^2)-1+x^2)/2/x, x, 0, 16); (%o14)/T/ -x+(7*x^3)/12-(11*x^5)/60+(5*x^7)/112-(19*x^9)/2160+(23*x^11)/15840-(3*x^13)/14560+(31*x^15)/1209600-... BTW how do you tweak the coefficients that Remez spits out? I'm using simulated annealing but it takes too long! I'd love to have an LLL/PLSQ code for this step!

Question 29

For a pure float implementation without FMAs, my tuner returned:

/* max rel err = 9.721465e-6 @ 6.52076149; max abs err = 3.647192e-6 @ -6.27824590e-2; max ulp err = 156.4300 @ 3.65526581 */ p = -6.03997751e-6f; p = p * c + 2.90031374e-1f; p = p * c + 8.99160922e-1f; p = p * c + 1.0f; q = -2.89837450e-1f; q = q * c - 1.06521440e+0f; q = q * c - 1.58327568e+0f; q = q * c - 8.86122882e-1f; r = ((p / q) * c + 1) * e;

Increasing the degree of the approximation adds FP operations and lengthens the dependency chain; something I do not desire. May be useful to others, so (+1).

Question 30

I am also using simulated annealing in my tuner. The trick is finding the "best" cooling rates and re-heating strategy. My heuristics often work well, but when creating approximations for answers to other questions I have at times had the same issue you seem to have here: While my simulated annealing delivered "meh!" results, other people's simulated annealing provided great results.So a bit of a black art, I am afraid.

Question 31

This is a bit late but rather interesting. I'm slowly getting to grips with GCC under Linux and finally got around to compiling your test harness and my various attempts together with system erfc() and erfcf() as references:

GCC -mavx2 -march=native -O2

There is a big surprise! GCC system library erfcf() is the fastest implementation of all. An order of magnitude more accurate too and 25% faster than the custom code! It is about 10% faster still with -ffast-math

I wonder how they have done that?

Question 32

I don't know. I didn't find it on a first try but then I got this link to Sun's version in erff.c. That doesn't look to me like it should be faster but looks can be deceptive. There is a second oddity which I will post about in a related thread. One form of my algorithm is almost as fast but only on the MS C compiler. Intel code generation looks good and is more accurate FMA but slowest of the lot! My own attempt at SIMD FMA vectorisation of erfcf/exp in parallel failed miserably :(

chux 158k17 gold badges160 silver badges311 bronze badges · Accepted Answer · 2024-01-10 15:29:55Z

How could one go about implementing erfcf() with high performance and with a relative accuracy of about 5 decimal digits across the entire input domain?

This answer, like OP's answer, does achieve about 5 decimal digits relative accuracy^*1. To achieve at least 5 likely obliges another term in the rational polynomial - which I doubt will slow the performance much.

Special values

In addition to performance and accuracy, the result of a my_erfc(x) for select values should be considered.

For x near 0.0, the result should be 1.0 exactly. This follows principle of least surprise.

Example: A user may readily rely on my_cos(0.0) == 1.0, or my_exp(0.0) == 1.0, so should my_erfc(0.0) == 1.0

Fortunately this is not hard to accomplish when a rational polynomial is used with coefficients of 1.0. See below.

Take advantage of the linear region

About 45% of all float are such that the erfc() is a simple linear equation with high accuracy.

#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
 if (fabsf(x) < ERFC_SMALL) {
 return 1.0f - TWO_OVER_ROOT_PI * x;
 }
 ...
}

Define performance rating

One could rate code in a worst case, what is the slowest for all float x?

I suspect OP is more interested in an assessment from x= 0.0 up to when erfc(x) is 0.0 (~x == 10) in a linear progression.

Sample performance results and test code follows. Of course, hard to beat the C library's crafted code.

0 time: 0.594 erfcf (Library function)
1 time: 0.703 fast_erfcf (OP's answer)
2 time: 0.688 fast_erfcf_alt (This answer)

int main(void) {
 float (*f[3])(float) = {erfcf, fast_erfcf, fast_erfcf_alt };
 for (int i = 0; i < 3; i++) {
 clock_t c0, c1;
 c0 = clock();
 float dx = 10.0f - nextafterf(10, 0);
 for (float x = 10.0f; x >= 0.0f; x -= dx) {
 f[i](x);
 }
 c1 = clock();
 printf("%d time: %g\n", i, (double) (c1 - c0) / CLOCKS_PER_SEC);
 }
}

Following code is modeled after OP self answer and incorporates ideas mentioned above.

#define TWO_OVER_ROOT_PI 1.1283791670955125738961589031215f
#define ERFC_SMALL 0.0053854f
float fast_erfcf_alt(float x) {
 if (fabsf(x) < ERFC_SMALL) {
 return 1.0f - TWO_OVER_ROOT_PI * x;
 }
 float a, c, e, p, q, r, s;
 a = fabsf(x);
 c = fminf(a, 10.5f);
 s = -c * c;
#if USE_BUILTIN_EXP
 e = expf(s);
#else // USE_BUILTIN_EXP
 e = my_expf (s);
#endif // USE_BUILTIN_EXP
 q = 0.374177223624056f;
 p = -5.00032254520701E-05f;
 q = q * c + 1.29051354328887f;
 p = p * c + 0.212358010453875f;
 q = q * c + 1.84437448399707f;
 p = p * c + 0.715675302663111f;
 q = q * c + 1.0f;
 p = p * c + 1.0f;
 r = e / q;
 r = r * p;
 if (x < 0.0f)
 r = 2.0f - r;
 if (isnan(x))
 r = x + x;
 return r;
}

Note: I'd consider eliminating c = fminf(a, 10.5f);.

This code's error results using OP's answer test harness.
(#define USE_FMA (0) and #define USE_BUILTIN_EXP (1)).

It's relative error (1.20e-05) vs. OP's error (1.06e-05) is primarily due to using 1.0 in both polynomials p and q.

Other code could use OP's answer's p and q and this answer's if (fabs(x) < ERFC_SMALL) for a best of both. Additional work needed to see how well the linear and rational polynomial sections meet.

 * This answer
 * max rel err = 1.203212e-05 @ 9.19353199e+00
 * max abs err = 9.690295e-06 @ -7.19872937e-02
 * max ulp err = 1.877379e+02 @ 8.03564072e+00
 *
 * OP's answer
 * max rel err = 1.061202e-05 @ 4.35268253e-01
 * max abs err = 9.398669e-06 @ -9.19871107e-02
 * max ulp err = 1.755724e+02 @ 2.46583080e+00

As an extra, below is a graphical view of OP's error over the [0... ~10] range. I found it illuminating to show where errors develop.

erfc error

^*1 Low relative accuracy is not possible in the sub-normal range of answers.

I will look at the updated answer shortly. FWIW, triggered by your earlier comment about the lack of a smooth transition at zero (due to minimax approximations having error peaks at both ends, one end being at zero here), I had started to investigate what would happen to accuracy if I constrain the leading coefficients of both numerator and denominator to 1.0f. Based on your accuracy data here, it seems your approach works somewhat better, which is interesting.
@njuffa, "if I constrain the leading coefficients of both numerator and denominator to 1.0f" --> might get a like result as this answer.
My attempts to generate rational approximations incorporating that constraint have all resulted in utter failure (probably due to misconceptions on my part how to go about it; this is not something I normally do). I have verified the stated error bounds for your set of coefficients on my local system, and am presently running the set through my heuristic tuner to see how close it can move the error bounds towards those provided in my answer.
@njuffa FWIW, your question has been fun and a learning experience for me.
Thank you for the helpful answer. My approximation tuner with USE_BUILTIN_EXP==1 converged to this: /* max rel err = 1.137295e-5 @ 8.01219559; max abs err = 9.512059e-6 @ -6.88479543e-2; max ulp err = 188.2552 @ 4.55690527 */ q = 3.74177814e-1f; p = -4.98741865e-5f; q = q * c + 1.29051733e+0f; p = p * c + 2.12357596e-1f; q = q * c + 1.84437072e+0f; p = p * c + 7.15674818e-1f; q = q * c + 1.0f; p = p * c + 1.0f;

CollectivesTM on Stack Overflow

Fast implementation of complementary error function with 5-digit accuracy

5 Answers 5

7 Comments

21 Comments

11 Comments

13 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

5 Answers 5

7 Comments

21 Comments

11 Comments

13 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related