Errors in floating-point arithmetic with rounding downwards and upwards

Question 1

Denote by RD(f) and RU(f) the computed approximation obtained by evaluating the function f in floating-point arithmetic with rounding downwards and rounding upwards, respectively.

Assume we know from rounding error analysis that

| RD(f)- f | < E, and

| RU(f)- f | < E

What is the bound for the difference between RD(f) and RU(f),

| RD(f)- RU(f) | < E, or

| RD(f)- RU(f) | < 2E ?

[UPD] In addition to the comments:

Consider a "toy" decimal floating point system with p = 4 (precision, the total number of digits in the significand, including one digit to the left of the radix point) and with an unbounded exponent. For this system, the unit roundoff, u, is defined as follows:

u = 1/2 * 10^{1-4} = 0.0005 for round to-nearest-mode,

u = 10^{1-4} = 0.001 for any of the directed rounding modes.

Let us suppose f = (1.324/1.567 + 1.641/1.878) needs to be computed in the such system.

Exact value of f is 1.7187285282921926....

The error analysis shows that

| RD (f) - f | <= E, and

| RU (f) - f | <= E,

where E = n * u * (|1.324/1.567| + |1.641/1.878|), and, as noted above, u = 0.001.

So,

E = 2 * 0.001 * 1.7187285282921926 =わ 0.0034374570565843852

(this is a bit rough estimate, because f was rounded).

Let us now calculate RD(f) and RF(f):

RD(f) = RD(0.8449266113592853 + 0.8738019169329073) = RD(0.844 + 0.873) = 1.717

RU(f) = RU(0.8449266113592853 + 0.8738019169329073) = RU(0.845 + 0.874) = 1.719

So,

|RD(f) - f| = 0.0017285282921926

|RU(f) – f| = 0.0002714717078074

and

|RD(f) - RU(f)| = 0.002 < 0.0034374570565843852

From this I suppose that |RD(f) - f| = E only if |RU(f) – f| = 0, and vice versa. Thus,

|RD(f) - RU(f)| <= E.

Or is something wrong in this example?

Question 2

The error analysis in the example is not correct. The maximum error when rounding down (or up) in a division a/b where the quotient is in [1/10, 1) is u / 10, not u, since the quotients have a lower exponent than 1 does. Additionally, it appears only the two division operations have been considered, but the addition has a rounding error too, particularly since the sum has a larger exponent (0) than the two things being added (both −1). Also, one cannot simply multiply the number of operations n by the "unit roundoff" u, since the "unit roundoff" varies with the result exponent.

Question 3

For these specific values, a bound on the error is u / 10 for each division and u for the addition, so E = 1.2 • u. Then the proper evaluation of RD(f) is RD(.8449 + .8738) = RD(1.7187) = 1.718, and RU(f) = (.8450 + .8739) = RU(1.7189) = 1.719. They happen to differ by less than E, but that is not true in general.

Question 4

@EricPostpischil For error analysis, I used the following paper: "C.-P. Jeannerod and S.M. Rump. Improved error bounds for inner products in floating-point artihmetic. SIAM. J. Matrix Anal. & Appl."(ti3.tuhh.de/paper/rump/JeaRu13.pdf). In this paper, an error bound is given for inner products (almost identical problem). The authors define the unit roundoff, u, as 1/2 * b ^ {1-p} for rounding to nearest, where b is the radix (b = 10 for decimal system). For the directed roundings, u is doubled. Here the unit roundoff is not a unit in the last place (ulp).

Question 5

(a) The rounding error used in that paper for a sum of products is ((1+u)^n−1)•f, not n•u•f. (b) That is a bound on the error, not the bound on the error. For simplicity, it uses a bound on the rounding error for t as a continuous function t•(1 + δ). In fact, a better bound is fixed for a given floating-point exponent and jumps when the exponent changes. But that is harder to work with mathematically. (c) This is not really relevant to your question...

Question 6

I suspect what you are getting at is that, since each rounding error occurs within an interval bounded by two representable numbers, say of length u, then if the rounding down uses up some amount x of that interval, then the rounding up uses u-x, so the error between the rounded down and rounded up amounts is at most x. That is true for one operation. But after multiple operations, the rounding-down computation may be dealing with some value td where the rounding-up computation may be dealing with some value tu, and td and tu are no longer in the same interval between representable numbers.

Question 7

Let u be the difference between 1 and the next representable number greater than 1. (This is the Unit of Least Precision [ULP] of 1, the value of the least significant bit in the significand for 1 in the floating-point format.)

Consider the function f(x) = (4 − (x + 1⁄2u) − 3) / (1⁄2u). The exact mathematical value of f(1) is 1, but the computed value with rounding down is 0, and the computed value with rounding up is 0:

When rounding down, 1 + 1⁄2u produces 1, then 4−1 produces 3, and 3−3 produces 0.
When rounding up, 1 + 1⁄2u produces 1+u, then 4−(1+u) is exactly 3−u but must round up to 3 because 3−u is not representable (it is between 3−2u and 3 since the ULP in [2, 4) is twice the ULP in [1, 2)), and 3−3 produces 0.

Thus, for this function on the domain x∈{1}, we have an error bound E = 1 such that |RD(f) − f| ≤ E and |RU(f) − f| ≤ E, but |RD(f) − RU(f)| ≤ 0.

In contrast, consider the function (x + 1⁄2u − 1) / (1⁄2u). Again the exact mathematical value of f(1) is 1, but now the computed value with rounding down is −1, and the computed value with rounding up is +1.

That answers the question in general. In a comment, you ask about f = a1/b1 + a2/b2 + ... + an/bn for positive ai and bi. Given the constraints, and if all the b values are representable, I think every rounding down error must have a negative (toward −∞) effect on the computed result, and every rounding up error must have a positive effect (toward +∞). (If any b value is not representable, its rounding will have the opposite effect on the final result, and the following analysis does not apply.) If E is the best (least) bound such that |RD(f) − f| < E and |RU(f) − f| < E, then it is not possible that |RD(f) − RU(f)| < E, and it is necessary that |RD(f) − RU(f)| < 2E.

(If you change < to ≤, then, if E is the best bound such that |RD(f) − f| ≤ E and |RU(f) − f| ≤ E, then |RD(f) − RU(f)| ≤ E is possible if and only if E is 0. Obviously this is true if E is 0, meaning the arithmetic is exact. If E is not zero, then one of the computations must have had some non-zero error, and therefore the other did too. And since the errors are necessarily monotonic as the computation progresses, the final errors must remain non-zero and of opposite signs.)

[It turns out I do not need the argument x in f(x); I could have simply used a constant function f as presented in the question. However, I wrote up the demonstration that way before I realized I did not need it.]

Question 8

For example, let f = a1/b1 + a2/b2 +...+ an/bn, where ai > 0 and bi > 0 for all i = 1,2,...,n. Then the bound will be E? How can this be proved? Maybe you can suggest the books that give an error analysis for floating-point intervals?

Question 9

@KonstantinIsupov: What do you mean by E? Are you using it as a value fixed by the floating point type, like the "machine" epsilon? Or is it some value calculated depending on the sequence of operations?

Question 10

E is some value calculated depending on the sequence of operations. Namely, I need to compute f = a1/b1 + a2/b2 + ... + an/bn (all ai > 0 and bi > 0) using rounding downwards and rounding upwards. In this case, E = n * u * f, where u is the unit roundoff. For directed roundings u = 2^{1-p}, where p is the precision.

Question 11

Thank you for the detailed answer. About f = a1/b1 + a2/b2 + ... + an/bn for positive ai and bi: I think that if every rounding downwards will give a large negative error (close to the bound), the corresponding rounding upwards error (positive) will be small, and vice versa. So, if |RD(f) − f| is close to E, then |RU(f) − f| will be close to 0, and vice versa, if |RU(f) − f| is close to E, then |RD(f) − f| will be close to 0. Thus, |RD(f) − RU(f)| < E. Or is it wrong?

Question 12

I added an example to the post

Eric Postpischil 233k15 gold badges199 silver badges379 bronze badges · Accepted Answer · 2018-04-01 14:48:54Z

Let u be the difference between 1 and the next representable number greater than 1. (This is the Unit of Least Precision [ULP] of 1, the value of the least significant bit in the significand for 1 in the floating-point format.)

Consider the function f(x) = (4 − (x + 1⁄2u) − 3) / (1⁄2u). The exact mathematical value of f(1) is 1, but the computed value with rounding down is 0, and the computed value with rounding up is 0:

When rounding down, 1 + 1⁄2u produces 1, then 4−1 produces 3, and 3−3 produces 0.
When rounding up, 1 + 1⁄2u produces 1+u, then 4−(1+u) is exactly 3−u but must round up to 3 because 3−u is not representable (it is between 3−2u and 3 since the ULP in [2, 4) is twice the ULP in [1, 2)), and 3−3 produces 0.

Thus, for this function on the domain x∈{1}, we have an error bound E = 1 such that |RD(f) − f| ≤ E and |RU(f) − f| ≤ E, but |RD(f) − RU(f)| ≤ 0.

In contrast, consider the function (x + 1⁄2u − 1) / (1⁄2u). Again the exact mathematical value of f(1) is 1, but now the computed value with rounding down is −1, and the computed value with rounding up is +1.

That answers the question in general. In a comment, you ask about f = a1/b1 + a2/b2 + ... + an/bn for positive ai and bi. Given the constraints, and if all the b values are representable, I think every rounding down error must have a negative (toward −∞) effect on the computed result, and every rounding up error must have a positive effect (toward +∞). (If any b value is not representable, its rounding will have the opposite effect on the final result, and the following analysis does not apply.) If E is the best (least) bound such that |RD(f) − f| < E and |RU(f) − f| < E, then it is not possible that |RD(f) − RU(f)| < E, and it is necessary that |RD(f) − RU(f)| < 2E.

(If you change < to ≤, then, if E is the best bound such that |RD(f) − f| ≤ E and |RU(f) − f| ≤ E, then |RD(f) − RU(f)| ≤ E is possible if and only if E is 0. Obviously this is true if E is 0, meaning the arithmetic is exact. If E is not zero, then one of the computations must have had some non-zero error, and therefore the other did too. And since the errors are necessarily monotonic as the computation progresses, the final errors must remain non-zero and of opposite signs.)

[It turns out I do not need the argument x in f(x); I could have simply used a constant function f as presented in the question. However, I wrote up the demonstration that way before I realized I did not need it.]

For example, let f = a1/b1 + a2/b2 +...+ an/bn, where ai > 0 and bi > 0 for all i = 1,2,...,n. Then the bound will be E? How can this be proved? Maybe you can suggest the books that give an error analysis for floating-point intervals?
@KonstantinIsupov: What do you mean by E? Are you using it as a value fixed by the floating point type, like the "machine" epsilon? Or is it some value calculated depending on the sequence of operations?
E is some value calculated depending on the sequence of operations. Namely, I need to compute f = a1/b1 + a2/b2 + ... + an/bn (all ai > 0 and bi > 0) using rounding downwards and rounding upwards. In this case, E = n * u * f, where u is the unit roundoff. For directed roundings u = 2^{1-p}, where p is the precision.
Thank you for the detailed answer. About f = a1/b1 + a2/b2 + ... + an/bn for positive ai and bi: I think that if every rounding downwards will give a large negative error (close to the bound), the corresponding rounding upwards error (positive) will be small, and vice versa. So, if |RD(f) − f| is close to E, then |RU(f) − f| will be close to 0, and vice versa, if |RU(f) − f| is close to E, then |RD(f) − f| will be close to 0. Thus, |RD(f) − RU(f)| < E. Or is it wrong?

CollectivesTM on Stack Overflow

Errors in floating-point arithmetic with rounding downwards and upwards

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

5 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related