How to calculate min/max values of floating point numbers?

Question 1

I'm trying to calculate the min/max, or the lowest to highest value range of a 48 bit floating point type MIL-STD-1750A (PDF) (WIKI).

Ex: How a double range is 1.7E +/- 308

I've looked around for equations, and am unsure if what I have found will work.

The first equation I found was first equation

The second was second equation

I'm not quite sure where to begin with these, if they are even correct in what I need.

Will someone impart their knowledge to me and help solve this?

Question 2

OK, so it looks like you have a 48 bit FP number implemented on 16 bit hardware. What exactly do you mean by "find the min/max value"?

Question 3

anything wrong with a<b?a:b and a>b?a:b

Question 4

I too am unclear. Are you trying to compare two numbers within a given epsilon? Are you trying to determine the possible range which might be within the margin of error of that value? Something else? We need more context.

Question 5

Sorry, I will edit to question to clarify. I would like to calculate the range in which a 48 bit FP value, formatted in MIL-STD-1750A can be, from the minimum and the maximum value.

Question 6

For 32-bit floating point, the maximum value is shown in Table III:

0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.

We can decompose the mantissa/exponent into a (close) decimal value as follows:

7FFFFF <base-16> = 8,388,607 <base-10>.

There are 23 bits of significance, so we divide 8,388,607 by 2^23.

8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)

as far as the exponent:

7F <base-16> = 127 <base-10>

and now we multiply the mantissa by 2^127 (the exponent)

8,388,607 / 2^23 * 2^127 = 
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38

This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.

The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as

mansissa=7FFFFFFFFF, exponent=7F.

again, we can compute

7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>

the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:

549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38

This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.

So, the max values are:

1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit

The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.

(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)

(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)

FYI

Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.

Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.

8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.

The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.

mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.

Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.

(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).

Question 7

I have one doubt. By doing 8,388,607 / 2^23 gives the what a single bit in the mantissa can represent. So how 8,388,607 / 2^23 * 2^127 does represent the maximum value ?

Question 8

@hariprasad, I will put a postscript on the answer, as it is too hard to explain in the comment format.

Question 9

where is the minimum value? it's requested in the title!

Question 10

@CharlieParker, unlike integer representations where min and max are offset, floating point formats use sign&magnitude representations, where the magnitude is composed of an exponent and a mantissa. The max number in sign magnitude is then (1) positive (sign bit is 0) and (2) the largest possible exponent and (3) largest possible mantissa. The min number is identical: except the sign bit changes, so it is (1) negative (sign bit is 1), yet (2) the largest possible exponent, and (3) the largest possible mantissa. As I explained above, the min number is just -max, so just throw on a - sign.

Question 11

You can borrow from how the IEEE floating point is laid out for fast comparison: sign, exponent, mantissa. however in that PDF I see mantissa and exponent are reversed.

This means that to compare you'll have to first check the sign bit and if one is not the winner yet you compare the exponents and then you compare the mantissa.

If one is positive and the other is negative then the positive is the max.

If both are positive and one exponent is larger then it is the max (if both are negative then it is the min)

Similarly for mantissa.

Erik Eidt Erik Eidt 34.8k6 gold badges61 silver badges95 bronze badges · Accepted Answer · 2015-08-21 17:05:09Z

For 32-bit floating point, the maximum value is shown in Table III:

0.9999998 x 2^127 represented in hex as: mantissa=7FFFFF, exponent=7F.

We can decompose the mantissa/exponent into a (close) decimal value as follows:

7FFFFF <base-16> = 8,388,607 <base-10>.

There are 23 bits of significance, so we divide 8,388,607 by 2^23.

8,388,607 / 2^23 = 0.99999988079071044921875 (see Table III)

as far as the exponent:

7F <base-16> = 127 <base-10>

and now we multiply the mantissa by 2^127 (the exponent)

8,388,607 / 2^23 * 2^127 = 
8,388,607 * 2^104 = 1.7014116317805962808001687976863 * 10^38

This is the largest 32-bit floating point value because the largest mantissa is used and the largest exponent.

The 48-bit floating point adds 16 bits of lessor significance mantissa but leaves the exponent the same size. Thus, the max value would be represented in hex as

mansissa=7FFFFFFFFF, exponent=7F.

again, we can compute

7FFFFFFFFF <base-16> = 549,755,813,887 <base-10>

the max exponent is still 127, but we need to divide by [23+16=39, so:] 2^39. 127-39=88, so just multiply by 2^88:

549,755,813,887 * 2^88 =
1.7014118346015974672186595864716 * 10^38

This is the largest 48-bit floating point value because we used the largest possible mantissa and largest possible exponent.

So, the max values are:

1.7014116317805962808001687976863 * 10^38, for 32-bit, and,
1.7014118346015974672186595864716 * 10^38, for 48-bit

The max value for 48-bit is just slightly larger than for 32-bit, which stands to reason since a few bits are added to the end of the mantissa.

(To be exact the maximum number for the 48-bit format can be expressed as a binary number that consists of 39 1's followed by 88 0's.)

(The smallest is just the negative of this value. The closest to zero without being zero can also easily be computed as per above: use the smallest possible (positive) mantissa:0000001 and the smallest possible exponent: 80 in hex, or -128 in decimal)

FYI

Some floating point formats use an unrepresented hidden 1 bit in the mantissa (this allows for one extra bit of precision in the mantissa, as follows: the first binary digit of all numbers (except 0, or denormals, see below) is a 1, therefore we don't have to store that 1, and we have an extra bit of precision). This particular format doesn't seem to do this.

Other floating point formats allow denormalized mantissa, which allows representing (positive) numbers smaller than smallest the exponent, by trading bits of precision for additional (negative) powers of 2. This easy to support if it doesn't also support the hidden one bit, a bit harder if it does.

8,388,607 / 2^23 is the value you'd get with mantissa=0x7FFFFF and exponent=0x00. It is not the single bit value but rather the value with a full mantissa and a neutral, or more specifically, a zero exponent.

The reason this value is not directly 8388607, and requires division (by 2^23 and hence is less than what you might expect) is that the implied radix point is in front of the mantissa, rather than after it. So, think +/-.111111111111111111111 (a sign bit, followed by a radix point, followed by twenty-three 1-bits) for the mantissa and +/-111111111111 (no radix point here, just an integer, in this case, 127) for the exponent.

mantissa = 0x7FFFFF with exponent = 0x7F is the largest value which corresponds to 8388607 * 2 ^ 104, where the 104 comes from 127-23: again, subtracting 23 powers of two because the mantissa has the radix point at the beginning. If the radix point were at the end, then the largest value (0x7FFFFF,0x7F) would indeed be 8,388,607 * 2 ^ 127.

Among others, there are possible ways we can consider a single bit value for the mantissa. One is mantissa=0x400000, and the other is mantissa=0x000001. without considering the radix point or the exponent, the former is 4,194,304, and the latter is 1. With a zero exponent and considering the radix point, the former is 0.5 (decimal) and the latter is 0.00000011920928955078125. With a maximum (or minimum) exponent, we can compute max and min single bit values.

(Note that the latter format where the mantissa has leading zeros would be considered denormalized in some number formats, and its normalized representation would be 0x400000 with an exponent of -23).

I have one doubt. By doing 8,388,607 / 2^23 gives the what a single bit in the mantissa can represent. So how 8,388,607 / 2^23 * 2^127 does represent the maximum value ?
@hariprasad, I will put a postscript on the answer, as it is too hard to explain in the comment format.
@CharlieParker, unlike integer representations where min and max are offset, floating point formats use sign&magnitude representations, where the magnitude is composed of an exponent and a mantissa. The max number in sign magnitude is then (1) positive (sign bit is 0) and (2) the largest possible exponent and (3) largest possible mantissa. The min number is identical: except the sign bit changes, so it is (1) negative (sign bit is 1), yet (2) the largest possible exponent, and (3) the largest possible mantissa. As I explained above, the min number is just -max, so just throw on a - sign.

Stack Exchange Network

How to calculate min/max values of floating point numbers?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

How to calculate min/max values of floating point numbers?

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions