Addition of two IEEE754-32bit single precision floating point numbers in C++

Question 1

I made a C++ program which takes a list of triplets with S.No. from a text file, where the triplet, i.e. a,b,c correspond to a+b=c (addition done using float data type). Now I convert a and b from hex to 32-bit binary numbers and extract sign, mantissa and fractional part and then add them. Finally I convert the sum back to hexadecimal representation and compare to c.

Explanation:
[Note: '0b': binary '0x': hex]
Take the test case 4 be954bb1 c2a2c2e1 c2a3582d.

Now a = 0xbe954bb1 = 0b10111110100101010100101110110001 .
And b = 0xc2a2c2e1 = 0b11000010101000101100001011100001 .
In IEEE754 first bit is sign which is 1 for both hence both are negative. next eight bits are for exponent i.e. 0b01111101 for a and 0b10000101 which correspond to 125 and 133 in decimal. These exponents have a offset of 127 so actual exponents are 125-127=-2 and 133-127=6
Rest bits are mantissa and the actual floating point number is 1.mantissa x 2^exponent where 1.mantissa is in binary. So our numbers are 1.00101010100101110110001 x 2^-2 and 1.01000101100001011100001 x 2^6
For adding we make the exponent same (the larger one, i.e. 6), hence we have 1.00101010100101110110001 x 2^-2 + 1.01000101100001011100001 x 2^6 = 0.0000000100101010100101110110001 x 2^6 + 1.01000101100001011100001 x 2^6 = ...

Code:

Now I extracted the sign, magnitude and mantissa. Note that the above mantissa will correspond to 0b00101010100101110110001 and 0b01000101100001011100001 and that is 1395633 and 2278113 in decimal.
We are working with integers only so we will not multiply both by same factor, i.e. 2^-23 to convert to .mantissa rather we add 2^23 to both to get 9784241 and 10666721 (which is 0b1mantissa).
Just forget the actual exponents and divide the mantissa of smaller exponent by 2^difference of exponents. Hence we get 9784241/(2^8)=38220 (round to nearest even) and 10666721
We can now add them to get 10704941 = 0b101000110101100000101101. Now we have exactly 24 digits in binary, we just remove the 2^23 to get 2316333 = 0b01000110101100000101101(extra '0' at start to make 23 digits).
Now actual expoenents didn't matter so we take exponent of sum as 133 (the higher one, since we shifted the lower one's mantissa)
The sign will be negative hence sign bit is 1
The sum can now be represented as 0b11000010101000110101100000101101=わ0xc2a3582d which is the sum. I hope you can understand the rest of it, which contains minor other facts about IEEE754-single precision 32 bit floating point decimal.

#include <cstdlib>
#include <fstream>
#include <iostream>
#include <algorithm>
using namespace std;
#define SIGN 0
#define EXP 1
#define FRAC 2
#define MANTISSA_MAX ((2<<22)-1)
#define HIDDEN_MANTISSA (2 << 22)
#define max(a,b) ((a>b)?a:b)
#define min(a,b) ((a>b)?b:a)
#define zero(x) (x[EXP] == 0 && x[FRAC] == 0)
#define denorm(x) (x[EXP] == 0 && x[FRAC] != 0)
#define sign(x) ((x[SIGN]==1)?-1:1)
#define hideMantissaBit(x) (x=x&MANTISSA_MAX);
#define convertToSignificand(x) x[FRAC]=((!denorm(x))?(x[FRAC]+HIDDEN_MANTISSA):(x[FRAC]));x[EXP]+=(denorm(x))?1:0
#define shift(x,y) x[FRAC] = shiftAndRound(x[FRAC],max(y[EXP] - x[EXP], 0));
#define renormalize(x,d) x[FRAC]=(d>0)?(shiftAndRound(x[FRAC],d)):(x[FRAC]<<(-d));
int msb_length(long l) {
 int cnt = 0;
 while (l >= 1) {
 l >>= 1;
 cnt++;
 }
 return cnt;
}
long shiftAndRound(long x, int d) {
 d = min(d, msb_length(x));
 int result = x >> d;
 if (d == 0)
 return result;
 int r = (x & ((d > 1) ? (2 << (d - 2)) : 1)) >> (d - 1);
 if (r == 1) {
 int lb = result & 1;
 int s = 0;
 for (int i = 0; i < d - 1 && s == 0; i++) {
 s |= x & 1;
 x >>= 1;
 }
 if (s == 1 || lb == 1)
 result += 1;
 }
 return result;
}
long* get(long x) {
 long* arr;
 arr = new long[3];
 arr[SIGN] = (x & 0x80000000) >> 31;
 arr[EXP] = (x & 0x7F800000) >> 23;
 arr[FRAC] = (x & 0x7FFFFF);
 return arr;
}
long add(long x, long y) {
 long* a = get(x);
 long* b = get(y);
 if (zero(a))
 return y;
 if (zero(b))
 return x;
 convertToSignificand(a);
 convertToSignificand(b);
 if (a[EXP] != b[EXP]) {
 shift(a, b)
 shift(b, a)
 }
 long* sum;
 sum = new long[3];
 sum[FRAC] = sign(a) * a[FRAC] + sign(b) * b[FRAC];
 sum[SIGN] = ((sum[FRAC] >= 0) ? 0 : 1);
 sum[FRAC] = abs(sum[FRAC]);
 if (sum[FRAC] != 0) {
 sum[EXP] = max(a[EXP], b[EXP]);
 int deltaExp = msb_length(sum[FRAC]) - 24;
 sum[EXP] += deltaExp;
 renormalize(sum, deltaExp)
 if (!denorm(sum))
 hideMantissaBit(sum[FRAC])
 } else {
 sum[EXP] = 0;
 }
 return (sum[SIGN] << 31) + (sum[EXP] << 23) + sum[FRAC];
}
int main() {
 ifstream inputStream;
 inputStream.open("src/input.txt");
 int r = 0, w = 0;
 long i = 0;
 while (!inputStream.eof()) {
 long a;
 long b;
 long c;
 inputStream >> dec >> i;
 inputStream >> hex >> a;
 inputStream >> hex >> b;
 inputStream >> hex >> c;
 if (c == add(a, b)) {
 r++;
 cout << "Test " << i << " PASSED" << endl;
 } else {
 w++;
 cout << "Test " << i << " FAILED" << endl;
 }
 }
 inputStream.close();
 cout << "Total " << r << " " << "PASSED " << w << " FAILED." << endl;
}

Please provide your views on my implementation. If you need any clarification, please ask me. For more details see IEEE754 Converter and Adding Floating Point Numbers.

Sample input:

0 c005de11 42f4b8e4 42f089f3
1 c2bd4228 3fb12f62 c2ba7d6a
2 3d8c556d 3fede65f 3ff6abb6
3 c0e9454c 3ee3960f c0db0beb
4 be954bb1 c2a2c2e1 c2a3582d
5 c3d0f1de c288a91c c3f31c25
6 bf6318c3 bef260b1 bfae248e
7 3f425736 bff456f0 bf932b55
8 bf367049 c0d76ab9 c0ee38c2
9 c29910c4 c0978ddf c2a289a2
10 c09f1c01 c2678d1a c27b709a
11 407d8462 c1874c4d c14f3782
12 c1b32297 c1910602 c222144c
13 409d6975 407a73f4 410d51b8
14 43365262 bf5fa00e 433572c2
15 c1481a5d 4331b215 4325306f
16 c39bd96b c36cec1e c40927bd
17 3efe0c78 c1e306f0 c1df0ebe
18 c2996785 3df31cb1 c2992abe
19 be6877d9 bdf46d42 beb1573d
20 40bb7d87 4054ab2d 4112e98f
21 be439c4e c29a36fb c29a98c9
22 41a4b54f 40921821 41c93b57
23 bfeb3480 bf6c8ae3 c030bcf9
24 c2e65c67 bff8fc3c c2ea4058
25 41e6caec 410dd1eb 4216d9f1
26 3e1132e1 c163979c c16152d0
27 bfc2c685 c00db62a c06f196c
28 bfb1b31f c20c043a c21191d3
29 c14aa2c2 4054b2a8 c1157618
30 c3c2ca3a bead7855 c3c2f598
31 bea8ba44 c2545034 c255a1a9
32 c27a466e 403312fc c26f153e
33 c2b5fa29 c027de73 c2bb391d
34 3e3b1571 3f8f0192 3fa66440
35 41d7e271 43ba9c1a 43c81a41
36 c1fee0c1 c109e53b c221e9af
37 c17067b9 3fff215b c150838e
38 42ac2fc8 c246d968 42118628
39 bf2b141e 40eccb24 40d768a0
...

Question 2

You should provide more information about your constraints, because otherwise the clearest code would obviously be *(double*)&c = *(double*)&a + *(double*)&b; instead of all this stuff. Your test harness would also be more compelling if it compared the output of your bit-twiddling code to the naive floating-point implementation, instead of requiring the input file to contain the answer in hex.

Question 3

You also leak memory like the dickens, and your use of macros is quite unusual for C++. Consider rewriting your macros as static inline functions instead.

Question 4

@ADG float, then.

Question 5

@ADG Most C/C++ implementations use IEEE754 32-bit single precision for float and IEEE754 64-bit double precision for double - implementations that do not use these representations are very rare, even among implementations that do not fully conform to the standard for operations. And *(float*)&x isn't a cast to a float, it is a "type-pun" via a pointer cast (generally in C++ you would use reinterpret_cast). You could also convert it to float or double in a portable way with ldexp[f].

Question 6

@Quuxplusone oh sorry.

Question 7

I took the liberty of rewriting your code:

#include <fstream>
#include <iostream>
#include <sstream>
#include <random>
#define min_float 0x00000000
#define max_float 0xffffffff
 #define exponent(x) (x << 1) >> 24
 #define mantissa(x) (x << 9) >> 9
 #define sign(x) x >> 31
uint32_t add(uint32_t x, uint32_t y) {
 uint32_t result_mantissa;
 uint32_t result_exponent;
 uint32_t result_sign;
 uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting
 // catch NaN
 if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
 if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;
 // catch Inf
 if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
 // both are inf
 if (different_sign)
 // Inf - Inf
 return 0x7F800000 + 1; // NaN
 else
 // both Inf or -Inf
 return x;
 }
 else if (!(exponent(x) ^ 0xFF)) return x;
 else if (!(exponent(y) ^ 0xFF)) return y;
 // both numbers are non-special
 uint32_t exp_difference;
 if (different_sign) {
 exp_difference = exponent(y) + exponent(x);
 }
 else {
 // no need to account for constant BO
 // beware of underflow
 if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
 else exp_difference = exponent(y) - exponent(x);
 }
 bool x_bigger_abs;
 if (exponent(x) > exponent(y)) x_bigger_abs = true;
 else if (exponent(x) < exponent(y)) x_bigger_abs = false;
 else if (mantissa(x) > mantissa(y)) x_bigger_abs = true;
 else x_bigger_abs = false;
 if (!different_sign) {
 //both numbers have same sign (this is a sum)
 result_sign = sign(x);
 if (x_bigger_abs) {
 result_mantissa = (mantissa(x) << 1) + (mantissa(y) << 1) >> exp_difference;
 result_exponent = exponent(x);
 }
 else {
 result_mantissa = (mantissa(y) << 1) + ((mantissa(x) << 1) >> exp_difference);
 result_exponent = exponent(y);
 }
 if (result_mantissa << 31) result_mantissa = (result_mantissa >> 1) + 1;
 else result_mantissa = (result_mantissa >> 1);
 }
 else {
 // this actually is a subtraction
 if (x_bigger_abs) {
 result_sign = sign(x);
 result_exponent = exponent(x);
 // subtract and round to 23 bit 
 // this means making room in our 32bit representation
 result_mantissa = (mantissa(x) << 1) - ((mantissa(y) << 1) >> exp_difference );
 }
 else {
 result_sign = sign(y);
 result_exponent = exponent(y);
 // subtract and round to 23 bit 
 // this means making room in our 32bit representation
 result_mantissa = (mantissa(y) << 1) - ((mantissa(x) << 1) >> exp_difference);
 }
 if (result_mantissa << 31) result_mantissa = ((result_mantissa >> 1) + 1);
 else result_mantissa = (result_mantissa >> 1);
 // normalize mantissa
 uint32_t temp = result_mantissa << 9;
 for (uint32_t count = 0; count < 23; ++count) {
 if (!((temp << count) >> 31)) {
 result_mantissa <<= ++count; // leading 1, so shift 1 more time
 result_exponent -= count;
 break;
 }
 }
 }
 uint32_t result = result_sign << 31 | result_exponent << 23 | result_mantissa;
 return result;
}
int main() {
 std::ifstream file;
 file.open("data.txt");
 if (!file.is_open()) {
 std::cout << "Could not open file." << std::endl;
 }
 int num_passed = 0, num_failed = 0;
 for (int num_tests = 0; num_tests < 1000; num_tests++) {
 uint32_t number1 = rand() % max_float; // generate a pseudo-random pattern
 uint32_t number2 = rand() % max_float; // generate a pseudo-random pattern
 float sum = *(float*) &number1 + *(float*) &number2;
 // compare resulting binary patterns
 if (*(uint32_t*)&sum & add(number1, number2)) {
 num_passed++;
 }
 else {
 std::cout << "Expected: " << *(uint32_t*) &sum << " pattern but recieved " << add(number1, number2) << std::endl;
 num_failed++;
 }
 }
 std::cout << "RNG test -- compared to float: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
 num_passed = 0;
 num_failed = 0;
 uint32_t i;
 uint32_t a, b, c;
 std::string line;
 while (getline(file, line)) {
 std::istringstream iss(line);
 //std::cout << line << endl;
 iss >> std::dec >> i;
 iss >> std::hex >> a;
 iss >> std::hex >> b;
 iss >> std::hex >> c;
 iss.clear();
 if (c & add(a, b)) {
 num_passed++;
 }
 else {
 num_failed++;
 }
 }
 std::cout << "Hex test -- compared to file: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
 file.close();
}

The thoughts:

general

You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your add function and perform the according actions. NaN + X = NaN, Inf + -Inf = NaN, ...

main()

For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.

I changed the long type variables to uint32_t. This is significant. C++ does not require any specific data model, but instead gives a set range in which a number has to be. So by using long you can end up getting a 2C, a 1C or even a uint64 with a sufficient BO, depending on your compiler (and ultimately your CPU). The tests perform accordingly:

Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2 FAILED. (using uint32_t)

The biggest problem comes when bitshifting (<< and >>) 2C or 1C numbers. With signed numbers the result is rarely usefull, but with unsigned numbers the result is a multiplication with 2^N. Latter is abused when shifting the mantissa to do addition / subtraction.

add

As mentioned by @Quuxplusone you are leaking memory because you never delete sum, a, b.

Same story with signed vs unsigned integers.

The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float

#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31

it conveniently resolves the memory leakage and removes the need for the get function.

shiftAndRound()

Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)

...
result = result >> (d - 1)
if (result << 31) result = (result >> 1) + 1;//round up
else result = (result >> 1); //round down

Also I don't know why you round after shifting and not after the addition. Its a lot more hairy. Especially with above mentioned access to 32bit storage and 23bit variables (and unsigned integers), doing it after the addition is simply:

//calculate with a 24-bit mantissa and round down to 23 bit
result = big_mantissa << 1 + (small_mantissa << 1) >> d
if (result << 31) result = (result >> 1) + 1; //round up
else result = (result >> 1); //round down

Question 8

I believe if you keep a tolerance in error then all answers might be just off by few bits (due to rounding)

Question 9

@ADG As I mentioned in my edit, I send the answer to early. It will probably take some time until I am satisfied with the format :)

Question 10

@ADG finished. Had to find a bug in the code. For some reason c == add(a,b) does work with long but not with uint32_t, so I had to use c & add(a,b) instead. Please check if any questions remain open

Question 11

@FirefoxMetzger Thanks but the output is wrong, e.g., c005de11 +42f4b8e4 should give 42f089f3 (120.269), but your program gives 4376d698 (246.838). Worst, c&add(a,b) marks it as passed. Where is the error?

Question 12

Here's why you are going to have problems with memory leaks given the above code:

In function long* get(long x) one lines 1 and 2 you have:

long* arr;
arr = new long[3];

But you then never do delete arr; in order to deallocate the memory given by the new operator. you also do it again in function long add(long x, long y where you have on lines 13 and 14 the same thing again:

long* sum;
sum = new long[3];

Neither of the memory locations referenced by these pointers will be cleared until the program exits and the OS cleans up memory for you, since in C++ anything you explicitly allocate with new like: foo* fooptr = new foo[5]; does not get cleaned up when they go out of scope at the end of the function. C++ expects anything explicitly allocated with new to be explicitly deallocated with delete once you are done with the memory.

However I cannot see any reason why long arr[3]; isn't being used in these cases since it (for all intents and purposes) does the same thing, but with the added advantage that the memory allocated when using the [] operator will be cleaned up automatically once the function goes out of scope.

Both long* arr = new long[3]; and long arr[3] can be accessed exactly the same ways, they are functionally identical with the only difference being "who is taking care of the memory allocations?" I recommend letting the compiler take care of memory allocations for you when you are not actively passing the pointer around between different scopes, since effectively all you have done with these two instances is dropped the pointers making the memory inaccessible until program termination.

Question 13

your case will fail with number 1 = 1.9375 and number2 = 1.9375. in this case mantissa for both numbers is 7864320. output sum is 1.875 which is wrong

Question 14

@Graipher Pointing out a bug that was unknown to the author is a valid answer. Furthermore, there is no obligation to suggest how to fix the bug.

Question 15

I am facing the issue with those numbers. Can someone suggest a solution for this??

FirefoxMetzger FirefoxMetzger 1,1016 silver badges11 bronze badges · Accepted Answer · 2016-08-28 06:34:17Z

I took the liberty of rewriting your code:

#include <fstream>
#include <iostream>
#include <sstream>
#include <random>
#define min_float 0x00000000
#define max_float 0xffffffff
 #define exponent(x) (x << 1) >> 24
 #define mantissa(x) (x << 9) >> 9
 #define sign(x) x >> 31
uint32_t add(uint32_t x, uint32_t y) {
 uint32_t result_mantissa;
 uint32_t result_exponent;
 uint32_t result_sign;
 uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting
 // catch NaN
 if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
 if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;
 // catch Inf
 if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
 // both are inf
 if (different_sign)
 // Inf - Inf
 return 0x7F800000 + 1; // NaN
 else
 // both Inf or -Inf
 return x;
 }
 else if (!(exponent(x) ^ 0xFF)) return x;
 else if (!(exponent(y) ^ 0xFF)) return y;
 // both numbers are non-special
 uint32_t exp_difference;
 if (different_sign) {
 exp_difference = exponent(y) + exponent(x);
 }
 else {
 // no need to account for constant BO
 // beware of underflow
 if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
 else exp_difference = exponent(y) - exponent(x);
 }
 bool x_bigger_abs;
 if (exponent(x) > exponent(y)) x_bigger_abs = true;
 else if (exponent(x) < exponent(y)) x_bigger_abs = false;
 else if (mantissa(x) > mantissa(y)) x_bigger_abs = true;
 else x_bigger_abs = false;
 if (!different_sign) {
 //both numbers have same sign (this is a sum)
 result_sign = sign(x);
 if (x_bigger_abs) {
 result_mantissa = (mantissa(x) << 1) + (mantissa(y) << 1) >> exp_difference;
 result_exponent = exponent(x);
 }
 else {
 result_mantissa = (mantissa(y) << 1) + ((mantissa(x) << 1) >> exp_difference);
 result_exponent = exponent(y);
 }
 if (result_mantissa << 31) result_mantissa = (result_mantissa >> 1) + 1;
 else result_mantissa = (result_mantissa >> 1);
 }
 else {
 // this actually is a subtraction
 if (x_bigger_abs) {
 result_sign = sign(x);
 result_exponent = exponent(x);
 // subtract and round to 23 bit 
 // this means making room in our 32bit representation
 result_mantissa = (mantissa(x) << 1) - ((mantissa(y) << 1) >> exp_difference );
 }
 else {
 result_sign = sign(y);
 result_exponent = exponent(y);
 // subtract and round to 23 bit 
 // this means making room in our 32bit representation
 result_mantissa = (mantissa(y) << 1) - ((mantissa(x) << 1) >> exp_difference);
 }
 if (result_mantissa << 31) result_mantissa = ((result_mantissa >> 1) + 1);
 else result_mantissa = (result_mantissa >> 1);
 // normalize mantissa
 uint32_t temp = result_mantissa << 9;
 for (uint32_t count = 0; count < 23; ++count) {
 if (!((temp << count) >> 31)) {
 result_mantissa <<= ++count; // leading 1, so shift 1 more time
 result_exponent -= count;
 break;
 }
 }
 }
 uint32_t result = result_sign << 31 | result_exponent << 23 | result_mantissa;
 return result;
}
int main() {
 std::ifstream file;
 file.open("data.txt");
 if (!file.is_open()) {
 std::cout << "Could not open file." << std::endl;
 }
 int num_passed = 0, num_failed = 0;
 for (int num_tests = 0; num_tests < 1000; num_tests++) {
 uint32_t number1 = rand() % max_float; // generate a pseudo-random pattern
 uint32_t number2 = rand() % max_float; // generate a pseudo-random pattern
 float sum = *(float*) &number1 + *(float*) &number2;
 // compare resulting binary patterns
 if (*(uint32_t*)&sum & add(number1, number2)) {
 num_passed++;
 }
 else {
 std::cout << "Expected: " << *(uint32_t*) &sum << " pattern but recieved " << add(number1, number2) << std::endl;
 num_failed++;
 }
 }
 std::cout << "RNG test -- compared to float: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
 num_passed = 0;
 num_failed = 0;
 uint32_t i;
 uint32_t a, b, c;
 std::string line;
 while (getline(file, line)) {
 std::istringstream iss(line);
 //std::cout << line << endl;
 iss >> std::dec >> i;
 iss >> std::hex >> a;
 iss >> std::hex >> b;
 iss >> std::hex >> c;
 iss.clear();
 if (c & add(a, b)) {
 num_passed++;
 }
 else {
 num_failed++;
 }
 }
 std::cout << "Hex test -- compared to file: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
 file.close();
}

The thoughts:

general

You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your add function and perform the according actions. NaN + X = NaN, Inf + -Inf = NaN, ...

main()

For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.

I changed the long type variables to uint32_t. This is significant. C++ does not require any specific data model, but instead gives a set range in which a number has to be. So by using long you can end up getting a 2C, a 1C or even a uint64 with a sufficient BO, depending on your compiler (and ultimately your CPU). The tests perform accordingly:

Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2 FAILED. (using uint32_t)

The biggest problem comes when bitshifting (<< and >>) 2C or 1C numbers. With signed numbers the result is rarely usefull, but with unsigned numbers the result is a multiplication with 2^N. Latter is abused when shifting the mantissa to do addition / subtraction.

add

As mentioned by @Quuxplusone you are leaking memory because you never delete sum, a, b.

Same story with signed vs unsigned integers.

The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float

#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31

it conveniently resolves the memory leakage and removes the need for the get function.

shiftAndRound()

Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)

...
result = result >> (d - 1)
if (result << 31) result = (result >> 1) + 1;//round up
else result = (result >> 1); //round down

Also I don't know why you round after shifting and not after the addition. Its a lot more hairy. Especially with above mentioned access to 32bit storage and 23bit variables (and unsigned integers), doing it after the addition is simply:

//calculate with a 24-bit mantissa and round down to 23 bit
result = big_mantissa << 1 + (small_mantissa << 1) >> d
if (result << 31) result = (result >> 1) + 1; //round up
else result = (result >> 1); //round down

I believe if you keep a tolerance in error then all answers might be just off by few bits (due to rounding)
@ADG As I mentioned in my edit, I send the answer to early. It will probably take some time until I am satisfied with the format :)
@ADG finished. Had to find a bug in the code. For some reason c == add(a,b) does work with long but not with uint32_t, so I had to use c & add(a,b) instead. Please check if any questions remain open
@FirefoxMetzger Thanks but the output is wrong, e.g., c005de11 +42f4b8e4 should give 42f089f3 (120.269), but your program gives 4376d698 (246.838). Worst, c&add(a,b) marks it as passed. Where is the error?

Stack Exchange Network

Addition of two IEEE754-32bit single precision floating point numbers in C++

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Addition of two IEEE754-32bit single precision floating point numbers in C++

3 Answers 3

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions