I'm confused by how a computer rounds off the last digit in the floating point representation. For example, I'm told that x=1.24327789
is stored in a computer with a 6-digit capacity then it;s floating point representation would be x=0.124328x101
, where clearly the last digit has been rounded.
My confusion refers to how the computer can have the capacity to round this last digit, if it hasn't a 7-digit capacity in order to know the 'last' digit.
I probably have a half-assed way of understanding this representation, but I really have no background in CompSci.
-
For calculations it is not uncommon for the cpu/fpu to use more bits internally and convert back to the width of the original operands (rounding it in the process) before yielding the result. This is what 80-bit fp is for, to prevent 64-bit calculations from losing accuracy.Martin Maat– Martin Maat2016年08月29日 22:05:32 +00:00Commented Aug 29, 2016 at 22:05
3 Answers 3
With a few odd exceptions, a floating point number is stored as binary in the standard known as IEEE 754. These are most often 32 bit (single percision) and 64 bit (double precision) representations. The 32 bit representation can store approximately 7 decimal digits, but remember that the underlying representation is in binary.
The representation of 1.2432778910
is actually 00111111100111110010001110111011
as a single precision IEEE 754 floating point number in binary.
This is made up of three parts:
- The sign bit (
0
indicating it is positive) - The exponent (
01111111
which is 127) giving 2127-127 coming out to be 20 - The mantissa (
00111110010001110111011
) which has a leading1
implicit.
This gives us +20 * 1.00111110010001110111011
which then gives you your number. If you look at the first couple bits there of 1.00111112
you will see that this is rather close to 1.2510
or 1.012
.
On reading binary numbers past the binary (not decimal) point...
Just as 10012
represents 1*23 + 0*22 + 0*21 + 1*20
, the value 1.0112
represets 1*20 + 0*2-1 + 1*2-2 + 1*2-3
or 1 + (1/4) + (1/8)
Now, that conversion I did a bit above - I grabbed it from an IEEE 754 converter because doing it by hand is tedious - its typically a good part of an assignment at the college level.
Rounding is actually a big deal. As described in Lecture Notes on the Status of IEEE Standard 754 for Binary Floating-Point Arithmetic from '97, rounding issues abounded in the 70s.
The number 1.24327789 in binary is 1.0011111001000111011101011011010101100011110011111000100000111...2
So, the 1 is assumed and the mantissa is 23 bits of that...
1 2 |
12345678901234567890123v
0011111001000111011101011011010101100011110011111000100000111
And you see at the arrow that this number should be rounded up which gives us 001111100100011101110112
which is the mantissa from above. And thats how it is represented and rounded. You should note that as this is rounded up it is slightly greater than the original and closer to 1.24327790737152110
.
-
2Rounding is also part of the reason why FP math circuitry is an order of magnitude more complex than the circuitry for integer math. Even with the massive improvements to speed and accuracy in the past 20 years it is still ridiculously complicated to handle FP numbers in hardware.user22815– user228152015年06月05日 04:11:30 +00:00Commented Jun 5, 2015 at 4:11
-
@Snowman guard digits and such are challenging and weren't consistent for awhile. See also Rounding Error section and Guard Digits.user40980– user409802015年06月05日 14:57:04 +00:00Commented Jun 5, 2015 at 14:57
If your computer uses IEEE 754 single precision floating point numbers (as most computers do), then it uses a representation where a number x is represented by a sign (+1 or -1), a mantissa which is an integer ≥ 2^23 and ≤ 2^24 - 1, and a binary exponent b; the number represented is sign * mantissa * 2^b. (There are some details not mentioned).
A number from 1 to 2 has an exponent b = -23, so that mantissa * 2^b is between 1 and 2. The mantissa is an integer, so the number is a multiple of 2^-23. You can calculate that 2^-23 is a bit over 0.000 000 119.
I think the other answers don't actually address the point of this question.
If you input the number 1.24327789
then at first it's simply a sequence of characters or keystrokes. In order to turn this into a numerical representation, a compiler or interpreter has to convert it. This program understands decimal representations and can produce a standard floating point binary representation. And in fact it can, for its internal purposes, first convert it to a higher precision binary representation than what can be stored later on, and then round that off for the final representation.
-
I apologize if this sounds naïve, it kind of is, but why bother with rounding the binary representation? Why not simply truncate?RWolfe– RWolfe2022年01月05日 08:51:15 +00:00Commented Jan 5, 2022 at 8:51
-
To squeeze out a bit more precision out of the process.isarandi– isarandi2022年01月05日 23:16:52 +00:00Commented Jan 5, 2022 at 23:16
-
but isn't it the act of rounding that causes the error? I personally would think that storing the value accurately would be more important.RWolfe– RWolfe2022年01月06日 07:28:20 +00:00Commented Jan 6, 2022 at 7:28
-
Truncation causes larger error. Imagine (in decimal) that the exact value would be 3.68, but we can only store two significant digits. Truncation leads to 3.6, while rounding gives 3.7. Since 3.7 is closer to 3.68 than 3.6 is, we should use rounding.isarandi– isarandi2022年01月06日 17:36:33 +00:00Commented Jan 6, 2022 at 17:36
-
I understand that, but I was specifically referring to the binary rounding, not decimal rounding, unless I'm misunderstanding your response?RWolfe– RWolfe2022年01月07日 01:16:39 +00:00Commented Jan 7, 2022 at 1:16