Half precision reader/writer for C#

Question 1

I'm reading/writing half precision floating point numbers in C#. These are basically 16 bit floats, compared to the usual 32/64 bit floats and doubles we are used to working with.

I've taken some highly tested Java code from an obvious "expert on the subject" here and modified it to work with C#. Is this correct?

// ignores the higher 16 bits
public static float toFloat( int hbits )
{
 int mant = hbits & 0x03ff; // 10 bits mantissa
 int exp = hbits & 0x7c00; // 5 bits exponent
 if( exp == 0x7c00 ) // NaN/Inf
 exp = 0x3fc00; // -> NaN/Inf
 else if( exp != 0 ) // normalized value
 {
 exp += 0x1c000; // exp - 15 + 127
 if( mant == 0 && exp > 0x1c400 ) // smooth transition
 return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
 | exp << 13 | 0x3ff ), 0);
 }
 else if( mant != 0 ) // && exp==0 -> subnormal
 {
 exp = 0x1c400; // make it normal
 do {
 mant <<= 1; // mantissa * 2
 exp -= 0x400; // decrease exp by 1
 } while( ( mant & 0x400 ) == 0 ); // while not normal
 mant &= 0x3ff; // discard subnormal bit
 } // else +/-0 -> +/-0
 return BitConverter.ToSingle(BitConverter.GetBytes( // combine all parts
 ( hbits & 0x8000 ) << 16 // sign << ( 31 - 15 )
 | ( exp | mant ) << 13 ), 0); // value << ( 23 - 10 )
}
// returns all higher 16 bits as 0 for all results
public static int fromFloat( float fval )
{
 int fbits = BitConverter.ToInt32(BitConverter.GetBytes(fval), 0);
 int sign = fbits >>> 16 & 0x8000; // sign only
 int val = ( fbits & 0x7fffffff ) + 0x1000; // rounded value
 if( val >= 0x47800000 ) // might be or become NaN/Inf
 { // avoid Inf due to rounding
 if( ( fbits & 0x7fffffff ) >= 0x47800000 )
 { // is or must become NaN/Inf
 if( val < 0x7f800000 ) // was value but too large
 return sign | 0x7c00; // make it +/-Inf
 return sign | 0x7c00 | // remains +/-Inf or NaN
 ( fbits & 0x007fffff ) >>> 13; // keep NaN (and Inf) bits
 }
 return sign | 0x7bff; // unrounded not quite Inf
 }
 if( val >= 0x38800000 ) // remains normalized value
 return sign | val - 0x38000000 >>> 13; // exp - 127 + 15
 if( val < 0x33000000 ) // too small for subnormal
 return sign; // becomes +/-0
 val = ( fbits & 0x7fffffff ) >>> 23; // tmp exp for subnormal calc
 return sign | ( ( fbits & 0x7fffff | 0x800000 ) // add subnormal bit
 + ( 0x800000 >>> val - 102 ) // round depending on cut off
 >>> 126 - val ); // div by 2^(1-(exp-127+15)) and >> 13 | exp=0
}

Question 2

Does it work? Does it do as you expect?

Question 3

Even if it encodes and decodes correctly I'm not an expert in floating point numbers so I won't know if there's a bug in there!

Question 4

Do you have any unit tests? I would start there in answering is this correct.

Question 5

Is this correct?

Maybe.

For a start, the >>> operator doesn't exist in C#.

Taking a guess, I replaced >>> with >> and then wrote the following 'unit test' for it:

 static void assertFloat(float fval)
 {
 int i = fromFloat(fval);
 float f2 = toFloat(i);
 if (fval != i)
 throw new ApplicationException();
 }
 static void Main(string[] args)
 {
 assertFloat(0);
 assertFloat(1);
 assertFloat(0.5f);
 assertFloat(-0.5f);
 assertFloat(-0);
 assertFloat(float.PositiveInfinity);
 assertFloat(float.NaN);
 float big = 1024 * 1024;
 big *= big;
 assertFloat(big);
 }

It failed the second test: because 1.0 is round-trip-converted to 1.000122.

I am disappointed by an encoding scheme which cannot encode '1.0' exactly.

However you didn't say how exact you expect the round-trip to be, so I don't know whether it's correct.

The following is a list of some input with corresponding output:

0 -> 0
1 -> 1.000122
1.1 -> 1.099609
-1 -> -1.000122
0.5 -> 0.500061
-0.5 -> -0.500061
0.001 -> 0.001000404
5.5 -> 5.5
5.6 -> 5.601563
5.7 -> 5.699219
0 -> 0
Infinity -> Infinity
NaN -> NaN
1024 -> 1024.125
1048576 -> Infinity

So it seems approximately correct for those numbers.

I can't say whether it's the best encoding. For example, this encoding gains the ability to express decimals but loses the ability to express integers (they become approximated) and big integers (they become infinity).

A different encoding scheme could be devised (and might be more useful depending on your application) which cannot express decimals but which gains the ability to express (approximately) some numbers which are bigger-than-the-biggest integer.

There are a lot of 'magic numbers' (i.e. hard-coded constants) in the code. To inspect for correctness I would need to guess/reverse engineer the way in which you encode/use the 16 bits for your "half precision float" numbers. You could make it easier by documenting the format using comments: which bits do you use for what?

Note this comment in the post you linked to:

I see what you mean but these NaN values wont be returned from Float.floatToIntBits which normalizes all NaNs to 0x7fc00000. The rounded val can thus never become nagative. Maybe it would be faster to use floatToRawIntBits (which does not do NaN normalization) and then deal with the overflow NaNs i.e. by adding || val < 0 to the first branch. This would also allow to preserve some of the extra NaN bits. I remember that I had planned to do this but couldn't find sufficient documentation on how to handle these bits and thus settled with normalized NaNs.

The BitConverter which you use may not (I haven't tested it) normalize NaN values in the same way.

The original OP linked to an spec for Half-precision floating-point format which says,

Integers between 0 and 2048 can be exactly represented

So my testing, which shows error in round-tripping 1, suggests that this spec is NOT implemented correctly.

When I test it, 1.0f is encoded as 0x3c00 which is correct according to the spec. So the bug is presumably being introduced in the toFloat method, specifically this statement:

return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
 | exp << 13 | 0x3ff ), 0);

This is a line which the "obvious expert on the subject" said they "implemented [as a] small extension compared to the book".

IOW you may have made a translation from the Java, but the Java doesn't correctly/fully implement the spec (it tries to improve on the spec, perhaps resulting in an inability to accurately decode small integers).

Question 6

Absolutely mind blowing review! Thanks a ton for your expertise. And yes, the "small extensions" can be disabled according to the "expert" ("You can safely remove the lines shown above if you don't want this extension.") .. see : stackoverflow.com/a/6162687/1294758 .. after disabling this extension does the accuracy improve?

Question 7

The accuracy improves at 1.0f: it converts to 1.0f exactly. You might like to write a set of unit tests to see how well it works at various specific values. You could post that unit-test code as a separate, follow-on question.

Question 8

The XNA framework provides support for HalfFloats. If you install it you could copy the correct dll into your project.

ChrisW ChrisW 13k1 gold badge35 silver badges76 bronze badges · Accepted Answer · 2014-03-29 21:06:40Z

Is this correct?

Maybe.

For a start, the >>> operator doesn't exist in C#.

Taking a guess, I replaced >>> with >> and then wrote the following 'unit test' for it:

 static void assertFloat(float fval)
 {
 int i = fromFloat(fval);
 float f2 = toFloat(i);
 if (fval != i)
 throw new ApplicationException();
 }
 static void Main(string[] args)
 {
 assertFloat(0);
 assertFloat(1);
 assertFloat(0.5f);
 assertFloat(-0.5f);
 assertFloat(-0);
 assertFloat(float.PositiveInfinity);
 assertFloat(float.NaN);
 float big = 1024 * 1024;
 big *= big;
 assertFloat(big);
 }

It failed the second test: because 1.0 is round-trip-converted to 1.000122.

I am disappointed by an encoding scheme which cannot encode '1.0' exactly.

However you didn't say how exact you expect the round-trip to be, so I don't know whether it's correct.

The following is a list of some input with corresponding output:

0 -> 0
1 -> 1.000122
1.1 -> 1.099609
-1 -> -1.000122
0.5 -> 0.500061
-0.5 -> -0.500061
0.001 -> 0.001000404
5.5 -> 5.5
5.6 -> 5.601563
5.7 -> 5.699219
0 -> 0
Infinity -> Infinity
NaN -> NaN
1024 -> 1024.125
1048576 -> Infinity

So it seems approximately correct for those numbers.

I can't say whether it's the best encoding. For example, this encoding gains the ability to express decimals but loses the ability to express integers (they become approximated) and big integers (they become infinity).

A different encoding scheme could be devised (and might be more useful depending on your application) which cannot express decimals but which gains the ability to express (approximately) some numbers which are bigger-than-the-biggest integer.

There are a lot of 'magic numbers' (i.e. hard-coded constants) in the code. To inspect for correctness I would need to guess/reverse engineer the way in which you encode/use the 16 bits for your "half precision float" numbers. You could make it easier by documenting the format using comments: which bits do you use for what?

Note this comment in the post you linked to:

I see what you mean but these NaN values wont be returned from Float.floatToIntBits which normalizes all NaNs to 0x7fc00000. The rounded val can thus never become nagative. Maybe it would be faster to use floatToRawIntBits (which does not do NaN normalization) and then deal with the overflow NaNs i.e. by adding || val < 0 to the first branch. This would also allow to preserve some of the extra NaN bits. I remember that I had planned to do this but couldn't find sufficient documentation on how to handle these bits and thus settled with normalized NaNs.

The BitConverter which you use may not (I haven't tested it) normalize NaN values in the same way.

The original OP linked to an spec for Half-precision floating-point format which says,

Integers between 0 and 2048 can be exactly represented

So my testing, which shows error in round-tripping 1, suggests that this spec is NOT implemented correctly.

When I test it, 1.0f is encoded as 0x3c00 which is correct according to the spec. So the bug is presumably being introduced in the toFloat method, specifically this statement:

return BitConverter.ToSingle(BitConverter.GetBytes( ( hbits & 0x8000 ) << 16
 | exp << 13 | 0x3ff ), 0);

This is a line which the "obvious expert on the subject" said they "implemented [as a] small extension compared to the book".

IOW you may have made a translation from the Java, but the Java doesn't correctly/fully implement the spec (it tries to improve on the spec, perhaps resulting in an inability to accurately decode small integers).

Absolutely mind blowing review! Thanks a ton for your expertise. And yes, the "small extensions" can be disabled according to the "expert" ("You can safely remove the lines shown above if you don't want this extension.") .. see : stackoverflow.com/a/6162687/1294758 .. after disabling this extension does the accuracy improve?
The accuracy improves at 1.0f: it converts to 1.0f exactly. You might like to write a set of unit tests to see how well it works at various specific values. You could post that unit-test code as a separate, follow-on question.

Stack Exchange Network

Half precision reader/writer for C#

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Half precision reader/writer for C#

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions