I'm reading/writing half precision floating point numbers in C#. These are basically 16 bit floats, compared to the usual 32/64 bit floats and doubles we are used to working with.
I've taken some highly tested Java code from an obvious "expert on the subject" here and modified it to work with C#. Is this correct?
// ignores the higher 16 bits
public static float toFloat( int hbits )
{
int mant = hbits & 0x03ff; // 10 bits mantissa
int exp = hbits & 0x7c00; // 5 bits exponent
if( exp == 0x7c00 ) // NaN/Inf
exp = 0x3fc00; // -> NaN/Inf
else if( exp != 0 ) // normalized value
{
exp += 0x1c000; // exp - 15 + 127
if( mant == 0 && exp > 0x1c400 ) // smooth transition
return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
| exp << 13 | 0x3ff ), 0);
}
else if( mant != 0 ) // && exp==0 -> subnormal
{
exp = 0x1c400; // make it normal
do {
mant <<= 1; // mantissa * 2
exp -= 0x400; // decrease exp by 1
} while( ( mant & 0x400 ) == 0 ); // while not normal
mant &= 0x3ff; // discard subnormal bit
} // else +/-0 -> +/-0
return BitConverter.ToSingle(BitConverter.GetBytes( // combine all parts
( hbits & 0x8000 ) << 16 // sign << ( 31 - 15 )
| ( exp | mant ) << 13 ), 0); // value << ( 23 - 10 )
}
// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
int fbits = BitConverter.ToInt32(BitConverter.GetBytes(fval), 0);
int sign = fbits >>> 16 & 0x8000; // sign only
int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value
if( val >= 0x47800000 ) // might be or become NaN/Inf
{ // avoid Inf due to rounding
if( ( fbits & 0x7fffffff ) >= 0x47800000 )
{ // is or must become NaN/Inf
if( val < 0x7f800000 ) // was value but too large
return sign | 0x7c00; // make it +/-Inf
return sign | 0x7c00 | // remains +/-Inf or NaN
( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
}
return sign | 0x7bff; // unrounded not quite Inf
}
if( val >= 0x38800000 ) // remains normalized value
return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
if( val < 0x33000000 ) // too small for subnormal
return sign; // becomes +/-0
val = ( fbits & 0x7fffffff ) >>> 23; // tmp exp for subnormal calc
return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
+ ( 0x800000 >>> val - 102 ) // round depending on cut off
>>> 126 - val ); // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}
-
1\$\begingroup\$ Does it work? Does it do as you expect? \$\endgroup\$Jeroen Vannevel– Jeroen Vannevel2014年03月21日 19:58:42 +00:00Commented Mar 21, 2014 at 19:58
-
\$\begingroup\$ Even if it encodes and decodes correctly I'm not an expert in floating point numbers so I won't know if there's a bug in there! \$\endgroup\$Robin Rodricks– Robin Rodricks2014年03月26日 04:40:49 +00:00Commented Mar 26, 2014 at 4:40
-
\$\begingroup\$ Do you have any unit tests? I would start there in answering is this correct. \$\endgroup\$dreza– dreza2014年03月30日 01:50:16 +00:00Commented Mar 30, 2014 at 1:50
2 Answers 2
Is this correct?
Maybe.
For a start, the >>>
operator doesn't exist in C#.
Taking a guess, I replaced >>>
with >>
and then wrote the following 'unit test' for it:
static void assertFloat(float fval)
{
int i = fromFloat(fval);
float f2 = toFloat(i);
if (fval != i)
throw new ApplicationException();
}
static void Main(string[] args)
{
assertFloat(0);
assertFloat(1);
assertFloat(0.5f);
assertFloat(-0.5f);
assertFloat(-0);
assertFloat(float.PositiveInfinity);
assertFloat(float.NaN);
float big = 1024 * 1024;
big *= big;
assertFloat(big);
}
It failed the second test: because 1.0
is round-trip-converted to 1.000122
.
I am disappointed by an encoding scheme which cannot encode '1.0' exactly.
However you didn't say how exact you expect the round-trip to be, so I don't know whether it's correct.
The following is a list of some input with corresponding output:
0 -> 0
1 -> 1.000122
1.1 -> 1.099609
-1 -> -1.000122
0.5 -> 0.500061
-0.5 -> -0.500061
0.001 -> 0.001000404
5.5 -> 5.5
5.6 -> 5.601563
5.7 -> 5.699219
0 -> 0
Infinity -> Infinity
NaN -> NaN
1024 -> 1024.125
1048576 -> Infinity
So it seems approximately correct for those numbers.
I can't say whether it's the best encoding. For example, this encoding gains the ability to express decimals but loses the ability to express integers (they become approximated) and big integers (they become infinity).
A different encoding scheme could be devised (and might be more useful depending on your application) which cannot express decimals but which gains the ability to express (approximately) some numbers which are bigger-than-the-biggest integer.
There are a lot of 'magic numbers' (i.e. hard-coded constants) in the code. To inspect for correctness I would need to guess/reverse engineer the way in which you encode/use the 16 bits for your "half precision float" numbers. You could make it easier by documenting the format using comments: which bits do you use for what?
Note this comment in the post you linked to:
I see what you mean but these NaN values wont be returned from Float.floatToIntBits which normalizes all NaNs to 0x7fc00000. The rounded val can thus never become nagative. Maybe it would be faster to use floatToRawIntBits (which does not do NaN normalization) and then deal with the overflow NaNs i.e. by adding || val < 0 to the first branch. This would also allow to preserve some of the extra NaN bits. I remember that I had planned to do this but couldn't find sufficient documentation on how to handle these bits and thus settled with normalized NaNs.
The BitConverter
which you use may not (I haven't tested it) normalize NaN values in the same way.
The original OP linked to an spec for Half-precision floating-point format which says,
Integers between 0 and 2048 can be exactly represented
So my testing, which shows error in round-tripping 1
, suggests that this spec is NOT implemented correctly.
When I test it, 1.0f
is encoded as 0x3c00
which is correct according to the spec. So the bug is presumably being introduced in the toFloat
method, specifically this statement:
return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
| exp << 13 | 0x3ff ), 0);
This is a line which the "obvious expert on the subject" said they "implemented [as a] small extension compared to the book".
IOW you may have made a translation from the Java, but the Java doesn't correctly/fully implement the spec (it tries to improve on the spec, perhaps resulting in an inability to accurately decode small integers).
-
\$\begingroup\$ Absolutely mind blowing review! Thanks a ton for your expertise. And yes, the "small extensions" can be disabled according to the "expert" ("You can safely remove the lines shown above if you don't want this extension.") .. see : stackoverflow.com/a/6162687/1294758 .. after disabling this extension does the accuracy improve? \$\endgroup\$Robin Rodricks– Robin Rodricks2014年03月30日 05:47:32 +00:00Commented Mar 30, 2014 at 5:47
-
1\$\begingroup\$ The accuracy improves at
1.0f
: it converts to1.0f
exactly. You might like to write a set of unit tests to see how well it works at various specific values. You could post that unit-test code as a separate, follow-on question. \$\endgroup\$ChrisW– ChrisW2014年03月30日 10:03:53 +00:00Commented Mar 30, 2014 at 10:03
The XNA framework provides support for HalfFloats. If you install it you could copy the correct dll into your project.