I made a C++ program which takes a list of triplets with S.No. from a text file, where the triplet, i.e. a,b,c correspond to a+b=c (addition done using float data type). Now I convert a and b from hex to 32-bit binary numbers and extract sign, mantissa and fractional part and then add them. Finally I convert the sum back to hexadecimal representation and compare to c.
Explanation:
[Note: '0b': binary '0x': hex]
Take the test case 4 be954bb1 c2a2c2e1 c2a3582d
.
- Now
a = 0xbe954bb1 = 0b10111110100101010100101110110001
. - And
b = 0xc2a2c2e1 = 0b11000010101000101100001011100001
. - In IEEE754 first bit is sign which is 1 for both hence both are negative. next eight bits are for exponent i.e.
0b01111101
for a and0b10000101
which correspond to125
and133
in decimal. These exponents have a offset of 127 so actual exponents are125-127=-2
and133-127=6
- Rest bits are mantissa and the actual floating point number is
1.mantissa x 2^exponent
where1.mantissa
is in binary. So our numbers are1.00101010100101110110001 x 2^-2
and1.01000101100001011100001 x 2^6
- For adding we make the exponent same (the larger one, i.e. 6), hence we have
1.00101010100101110110001 x 2^-2 + 1.01000101100001011100001 x 2^6 = 0.0000000100101010100101110110001 x 2^6 + 1.01000101100001011100001 x 2^6 = ...
Code:
- Now I extracted the sign, magnitude and mantissa. Note that the above mantissa will correspond to
0b00101010100101110110001
and0b01000101100001011100001
and that is1395633
and2278113
in decimal. - We are working with integers only so we will not multiply both by same factor, i.e.
2^-23
to convert to.mantissa
rather we add2^23
to both to get9784241
and10666721
(which is0b1mantissa
). - Just forget the actual exponents and divide the mantissa of smaller exponent by
2^difference of exponents
. Hence we get9784241/(2^8)=38220
(round to nearest even) and10666721
- We can now add them to get
10704941 = 0b101000110101100000101101
. Now we have exactly 24 digits in binary, we just remove the2^23
to get2316333 = 0b01000110101100000101101
(extra '0' at start to make 23 digits). - Now actual expoenents didn't matter so we take exponent of sum as
133
(the higher one, since we shifted the lower one's mantissa) - The sign will be negative hence sign bit is
1
- The sum can now be represented as
0b11000010101000110101100000101101=わ0xc2a3582d
which is the sum. I hope you can understand the rest of it, which contains minor other facts about IEEE754-single precision 32 bit floating point decimal.
#include <cstdlib>
#include <fstream>
#include <iostream>
#include <algorithm>
using namespace std;
#define SIGN 0
#define EXP 1
#define FRAC 2
#define MANTISSA_MAX ((2<<22)-1)
#define HIDDEN_MANTISSA (2 << 22)
#define max(a,b) ((a>b)?a:b)
#define min(a,b) ((a>b)?b:a)
#define zero(x) (x[EXP] == 0 && x[FRAC] == 0)
#define denorm(x) (x[EXP] == 0 && x[FRAC] != 0)
#define sign(x) ((x[SIGN]==1)?-1:1)
#define hideMantissaBit(x) (x=x&MANTISSA_MAX);
#define convertToSignificand(x) x[FRAC]=((!denorm(x))?(x[FRAC]+HIDDEN_MANTISSA):(x[FRAC]));x[EXP]+=(denorm(x))?1:0
#define shift(x,y) x[FRAC] = shiftAndRound(x[FRAC],max(y[EXP] - x[EXP], 0));
#define renormalize(x,d) x[FRAC]=(d>0)?(shiftAndRound(x[FRAC],d)):(x[FRAC]<<(-d));
int msb_length(long l) {
int cnt = 0;
while (l >= 1) {
l >>= 1;
cnt++;
}
return cnt;
}
long shiftAndRound(long x, int d) {
d = min(d, msb_length(x));
int result = x >> d;
if (d == 0)
return result;
int r = (x & ((d > 1) ? (2 << (d - 2)) : 1)) >> (d - 1);
if (r == 1) {
int lb = result & 1;
int s = 0;
for (int i = 0; i < d - 1 && s == 0; i++) {
s |= x & 1;
x >>= 1;
}
if (s == 1 || lb == 1)
result += 1;
}
return result;
}
long* get(long x) {
long* arr;
arr = new long[3];
arr[SIGN] = (x & 0x80000000) >> 31;
arr[EXP] = (x & 0x7F800000) >> 23;
arr[FRAC] = (x & 0x7FFFFF);
return arr;
}
long add(long x, long y) {
long* a = get(x);
long* b = get(y);
if (zero(a))
return y;
if (zero(b))
return x;
convertToSignificand(a);
convertToSignificand(b);
if (a[EXP] != b[EXP]) {
shift(a, b)
shift(b, a)
}
long* sum;
sum = new long[3];
sum[FRAC] = sign(a) * a[FRAC] + sign(b) * b[FRAC];
sum[SIGN] = ((sum[FRAC] >= 0) ? 0 : 1);
sum[FRAC] = abs(sum[FRAC]);
if (sum[FRAC] != 0) {
sum[EXP] = max(a[EXP], b[EXP]);
int deltaExp = msb_length(sum[FRAC]) - 24;
sum[EXP] += deltaExp;
renormalize(sum, deltaExp)
if (!denorm(sum))
hideMantissaBit(sum[FRAC])
} else {
sum[EXP] = 0;
}
return (sum[SIGN] << 31) + (sum[EXP] << 23) + sum[FRAC];
}
int main() {
ifstream inputStream;
inputStream.open("src/input.txt");
int r = 0, w = 0;
long i = 0;
while (!inputStream.eof()) {
long a;
long b;
long c;
inputStream >> dec >> i;
inputStream >> hex >> a;
inputStream >> hex >> b;
inputStream >> hex >> c;
if (c == add(a, b)) {
r++;
cout << "Test " << i << " PASSED" << endl;
} else {
w++;
cout << "Test " << i << " FAILED" << endl;
}
}
inputStream.close();
cout << "Total " << r << " " << "PASSED " << w << " FAILED." << endl;
}
Please provide your views on my implementation. If you need any clarification, please ask me. For more details see IEEE754 Converter and Adding Floating Point Numbers.
Sample input:
0 c005de11 42f4b8e4 42f089f3
1 c2bd4228 3fb12f62 c2ba7d6a
2 3d8c556d 3fede65f 3ff6abb6
3 c0e9454c 3ee3960f c0db0beb
4 be954bb1 c2a2c2e1 c2a3582d
5 c3d0f1de c288a91c c3f31c25
6 bf6318c3 bef260b1 bfae248e
7 3f425736 bff456f0 bf932b55
8 bf367049 c0d76ab9 c0ee38c2
9 c29910c4 c0978ddf c2a289a2
10 c09f1c01 c2678d1a c27b709a
11 407d8462 c1874c4d c14f3782
12 c1b32297 c1910602 c222144c
13 409d6975 407a73f4 410d51b8
14 43365262 bf5fa00e 433572c2
15 c1481a5d 4331b215 4325306f
16 c39bd96b c36cec1e c40927bd
17 3efe0c78 c1e306f0 c1df0ebe
18 c2996785 3df31cb1 c2992abe
19 be6877d9 bdf46d42 beb1573d
20 40bb7d87 4054ab2d 4112e98f
21 be439c4e c29a36fb c29a98c9
22 41a4b54f 40921821 41c93b57
23 bfeb3480 bf6c8ae3 c030bcf9
24 c2e65c67 bff8fc3c c2ea4058
25 41e6caec 410dd1eb 4216d9f1
26 3e1132e1 c163979c c16152d0
27 bfc2c685 c00db62a c06f196c
28 bfb1b31f c20c043a c21191d3
29 c14aa2c2 4054b2a8 c1157618
30 c3c2ca3a bead7855 c3c2f598
31 bea8ba44 c2545034 c255a1a9
32 c27a466e 403312fc c26f153e
33 c2b5fa29 c027de73 c2bb391d
34 3e3b1571 3f8f0192 3fa66440
35 41d7e271 43ba9c1a 43c81a41
36 c1fee0c1 c109e53b c221e9af
37 c17067b9 3fff215b c150838e
38 42ac2fc8 c246d968 42118628
39 bf2b141e 40eccb24 40d768a0
...
3 Answers 3
I took the liberty of rewriting your code:
#include <fstream>
#include <iostream>
#include <sstream>
#include <random>
#define min_float 0x00000000
#define max_float 0xffffffff
#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31
uint32_t add(uint32_t x, uint32_t y) {
uint32_t result_mantissa;
uint32_t result_exponent;
uint32_t result_sign;
uint32_t different_sign = sign(x) ^ sign(y); //boolean but lets not do any type casting
// catch NaN
if (!(exponent(x) ^ 0xFF) && mantissa(x)) return x;
if (!(exponent(y) ^ 0xFF) && mantissa(y)) return y;
// catch Inf
if (!(exponent(x) ^ 0xFF) && !(exponent(y) ^ 0xFF)) {
// both are inf
if (different_sign)
// Inf - Inf
return 0x7F800000 + 1; // NaN
else
// both Inf or -Inf
return x;
}
else if (!(exponent(x) ^ 0xFF)) return x;
else if (!(exponent(y) ^ 0xFF)) return y;
// both numbers are non-special
uint32_t exp_difference;
if (different_sign) {
exp_difference = exponent(y) + exponent(x);
}
else {
// no need to account for constant BO
// beware of underflow
if (exponent(x) > exponent(y)) exp_difference = exponent(x) - exponent(y);
else exp_difference = exponent(y) - exponent(x);
}
bool x_bigger_abs;
if (exponent(x) > exponent(y)) x_bigger_abs = true;
else if (exponent(x) < exponent(y)) x_bigger_abs = false;
else if (mantissa(x) > mantissa(y)) x_bigger_abs = true;
else x_bigger_abs = false;
if (!different_sign) {
//both numbers have same sign (this is a sum)
result_sign = sign(x);
if (x_bigger_abs) {
result_mantissa = (mantissa(x) << 1) + (mantissa(y) << 1) >> exp_difference;
result_exponent = exponent(x);
}
else {
result_mantissa = (mantissa(y) << 1) + ((mantissa(x) << 1) >> exp_difference);
result_exponent = exponent(y);
}
if (result_mantissa << 31) result_mantissa = (result_mantissa >> 1) + 1;
else result_mantissa = (result_mantissa >> 1);
}
else {
// this actually is a subtraction
if (x_bigger_abs) {
result_sign = sign(x);
result_exponent = exponent(x);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(x) << 1) - ((mantissa(y) << 1) >> exp_difference );
}
else {
result_sign = sign(y);
result_exponent = exponent(y);
// subtract and round to 23 bit
// this means making room in our 32bit representation
result_mantissa = (mantissa(y) << 1) - ((mantissa(x) << 1) >> exp_difference);
}
if (result_mantissa << 31) result_mantissa = ((result_mantissa >> 1) + 1);
else result_mantissa = (result_mantissa >> 1);
// normalize mantissa
uint32_t temp = result_mantissa << 9;
for (uint32_t count = 0; count < 23; ++count) {
if (!((temp << count) >> 31)) {
result_mantissa <<= ++count; // leading 1, so shift 1 more time
result_exponent -= count;
break;
}
}
}
uint32_t result = result_sign << 31 | result_exponent << 23 | result_mantissa;
return result;
}
int main() {
std::ifstream file;
file.open("data.txt");
if (!file.is_open()) {
std::cout << "Could not open file." << std::endl;
}
int num_passed = 0, num_failed = 0;
for (int num_tests = 0; num_tests < 1000; num_tests++) {
uint32_t number1 = rand() % max_float; // generate a pseudo-random pattern
uint32_t number2 = rand() % max_float; // generate a pseudo-random pattern
float sum = *(float*) &number1 + *(float*) &number2;
// compare resulting binary patterns
if (*(uint32_t*)&sum & add(number1, number2)) {
num_passed++;
}
else {
std::cout << "Expected: " << *(uint32_t*) &sum << " pattern but recieved " << add(number1, number2) << std::endl;
num_failed++;
}
}
std::cout << "RNG test -- compared to float: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
num_passed = 0;
num_failed = 0;
uint32_t i;
uint32_t a, b, c;
std::string line;
while (getline(file, line)) {
std::istringstream iss(line);
//std::cout << line << endl;
iss >> std::dec >> i;
iss >> std::hex >> a;
iss >> std::hex >> b;
iss >> std::hex >> c;
iss.clear();
if (c & add(a, b)) {
num_passed++;
}
else {
num_failed++;
}
}
std::cout << "Hex test -- compared to file: Total " << num_passed << " " << "PASSED " << num_failed << " FAILED." << std::endl;
file.close();
}
The thoughts:
general
You are not handling the special cases of the IEEE754, which are Inf and NaN values. You will have to catch them in your add
function and perform the according actions. NaN + X = NaN
, Inf + -Inf = NaN
, ...
main()
For what ever reason your way of reading the file caught me in an infinite loop, so I adjusted the reading.
I changed the long
type variables to uint32_t
. This is significant. C++ does not require any specific data model, but instead gives a set range in which a number has to be. So by using long
you can end up getting a 2C, a 1C or even a uint64 with a sufficient BO, depending on your compiler (and ultimately your CPU). The tests perform accordingly:
Total 14 PASSED 26 FAILED. (using long == 2C on my machine)
Total 38 PASSED 2 FAILED. (using uint32_t)
The biggest problem comes when bitshifting (<<
and >>
) 2C or 1C numbers. With signed numbers the result is rarely usefull, but with unsigned numbers the result is a multiplication with 2^N. Latter is abused when shifting the mantissa to do addition / subtraction.
add
As mentioned by @Quuxplusone you are leaking memory because you never delete sum
, a
, b
.
Same story with signed vs unsigned integers.
The code becomes a lot more readable, if you use preprocessor macros for the individual parts of float
#define exponent(x) (x << 1) >> 24
#define mantissa(x) (x << 9) >> 9
#define sign(x) x >> 31
it conveniently resolves the memory leakage and removes the need for the get
function.
shiftAndRound()
Your rounding is not very clean. You know that the mantissa is at most 23 bit and you have access to 32 bit, just shift to the left by 1 to make room for the 24th bit (on which rounding depends)
...
result = result >> (d - 1)
if (result << 31) result = (result >> 1) + 1;//round up
else result = (result >> 1); //round down
Also I don't know why you round after shifting and not after the addition. Its a lot more hairy. Especially with above mentioned access to 32bit storage and 23bit variables (and unsigned integers), doing it after the addition is simply:
//calculate with a 24-bit mantissa and round down to 23 bit
result = big_mantissa << 1 + (small_mantissa << 1) >> d
if (result << 31) result = (result >> 1) + 1; //round up
else result = (result >> 1); //round down
-
\$\begingroup\$ I believe if you keep a tolerance in error then all answers might be just off by few bits (due to rounding) \$\endgroup\$RE60K– RE60K2016年08月28日 06:36:26 +00:00Commented Aug 28, 2016 at 6:36
-
\$\begingroup\$ @ADG As I mentioned in my edit, I send the answer to early. It will probably take some time until I am satisfied with the format :) \$\endgroup\$FirefoxMetzger– FirefoxMetzger2016年08月28日 06:46:58 +00:00Commented Aug 28, 2016 at 6:46
-
\$\begingroup\$ @ADG finished. Had to find a bug in the code. For some reason
c == add(a,b)
does work withlong
but not withuint32_t
, so I had to usec & add(a,b)
instead. Please check if any questions remain open \$\endgroup\$FirefoxMetzger– FirefoxMetzger2016年08月28日 12:08:39 +00:00Commented Aug 28, 2016 at 12:08 -
1\$\begingroup\$ @FirefoxMetzger Thanks but the output is wrong, e.g., c005de11 +42f4b8e4 should give 42f089f3 (120.269), but your program gives 4376d698 (246.838). Worst, c&add(a,b) marks it as passed. Where is the error? \$\endgroup\$user984260– user9842602017年04月24日 10:26:52 +00:00Commented Apr 24, 2017 at 10:26
Here's why you are going to have problems with memory leaks given the above code:
In function long* get(long x)
one lines 1 and 2 you have:
long* arr;
arr = new long[3];
But you then never do delete arr;
in order to deallocate the memory given by the new
operator. you also do it again in function long add(long x, long y
where you have on lines 13 and 14 the same thing again:
long* sum;
sum = new long[3];
Neither of the memory locations referenced by these pointers will be cleared until the program exits and the OS cleans up memory for you, since in C++ anything you explicitly allocate with new
like: foo* fooptr = new foo[5];
does not get cleaned up when they go out of scope at the end of the function. C++ expects anything explicitly allocated with new
to be explicitly deallocated with delete
once you are done with the memory.
However I cannot see any reason why long arr[3];
isn't being used in these cases since it (for all intents and purposes) does the same thing, but with the added advantage that the memory allocated when using the []
operator will be cleaned up automatically once the function goes out of scope.
Both long* arr = new long[3];
and long arr[3]
can be accessed exactly the same ways, they are functionally identical with the only difference being "who is taking care of the memory allocations?" I recommend letting the compiler take care of memory allocations for you when you are not actively passing the pointer around between different scopes, since effectively all you have done with these two instances is dropped the pointers making the memory inaccessible until program termination.
your case will fail with number 1 = 1.9375 and number2 = 1.9375. in this case mantissa for both numbers is 7864320. output sum is 1.875 which is wrong
-
1\$\begingroup\$ @Graipher Pointing out a bug that was unknown to the author is a valid answer. Furthermore, there is no obligation to suggest how to fix the bug. \$\endgroup\$200_success– 200_success2017年06月30日 10:39:25 +00:00Commented Jun 30, 2017 at 10:39
-
\$\begingroup\$ I am facing the issue with those numbers. Can someone suggest a solution for this?? \$\endgroup\$sathyarokz– sathyarokz2017年06月30日 10:51:34 +00:00Commented Jun 30, 2017 at 10:51
*(double*)&c = *(double*)&a + *(double*)&b;
instead of all this stuff. Your test harness would also be more compelling if it compared the output of your bit-twiddling code to the naive floating-point implementation, instead of requiring the input file to contain the answer in hex. \$\endgroup\$static inline
functions instead. \$\endgroup\$float
and IEEE754 64-bit double precision fordouble
- implementations that do not use these representations are very rare, even among implementations that do not fully conform to the standard for operations. And*(float*)&x
isn't a cast to a float, it is a "type-pun" via a pointer cast (generally in C++ you would usereinterpret_cast
). You could also convert it to float or double in a portable way withldexp[f]
. \$\endgroup\$