I'm trying to write a function to replace the standard std::stod(...)
. I want to parse the following valid inputs:
/[+-]?\d+(?:\.\d+)?(?:e[-+]?\d+)?/g
Here's the RegExr link.
And also the words infinity
, +infinity
, -infinity
, undefined
, +undefined
and -undefined
. Note, if I remember well, infinity
and +infinity
are the same underlying values while undefined
, +undefined
and -undefined
are all different NaN
values.
So I came up with this:
#include <iostream>
#include <cmath>
#include <limits>
#include <cstring>
using namespace std;
class invalid_number { };
static bool is_digit(char c) {
// probably faster than `return c >= '0' && c <= '9';`?
switch (c) {
case '0': case '1': case '2':
case '3': case '4': case '5':
case '6': case '7': case '8':
case '9': return true;
default: return false;
}
}
double safe_pow(double a, double b) {
if (b < 0) return 1.0 / pow(a, abs(b));
else return pow(a, b);
}
// parses /[+-]?\d+(?:\.\d+)?(?:e[-+]?\d+)?/g
static double s2d(char * s) {
bool negative = (s[0] == '-');
bool positive = (s[0] == '+');
if (positive || negative) s++;
// + infinity == infinity
if (!strcmp(s, "infinity0円")) {
if (negative) return - numeric_limits<double>::infinity();
return numeric_limits<double>::infinity();
}
// + undefined != undefined != - undefined
if (!strcmp(s, "undefined0円")) {
if (negative) return - numeric_limits<double>::quiet_NaN();
if (positive) return + numeric_limits<double>::quiet_NaN();
return numeric_limits<double>::quiet_NaN();
}
if (!is_digit(* s)) throw invalid_number();
int64_t x = 0, point = 0, d = 0;
while (is_digit(* s)) {
x = (x * 10) + (* s) - '0';
s++; point++; d++;
}
if ((* s) == '.') {
s++;
if (!is_digit(* s)) throw invalid_number();
while (is_digit(* s)) {
x = x * 10 + (* s) - '0';
s++; d++;
}
} else point = 0;
if (negative) x *= -1;
uint64_t e = 0; bool ne = false;
if ((* s) == 'e') {
s++;
if ((* s) == '-') { ne = true; s++; }
else if ((* s) == '+') s++;
if (!is_digit(* s)) throw invalid_number();
while (is_digit(* s)) {
e = (e * 10) + (* s) - '0';
s++;
}
}
if (!point) return x;
point = d - point;
if (ne) point += e; else point -= e;
return x / safe_pow(10, point);
}
int main(int argc, char * argv[]) {
cout << "test number: ";
string number;
getline(cin, number);
/* test cases:
-537e+4
+20
34.3e-10
2e5
-5e3
-2
-22.27e0
*/
try {
cout << s2d((char *)number.c_str()) << endl;
} catch (invalid_number & inv) {
cout << "invalid number" << endl;
}
return 0;
}
3 Answers 3
Edge cases
Even though the mathematical quotient may be in the double
range, safe_pow(10, point)
may be outside the double
range.
An alternative scales by 5, then 2
// return x / safe_pow(10, point);
x *= pow(5, -point);
x *= pow(2, -point); // see also scalbn()
return x;
Avoid premature optimization
// probably faster than return c >= '0' && c <= '9';
is simply not supported. The switch statement could as well by 10x slower. Look to isdigit()
.
Nothing safer about safe_pow()
Instead:
double safe_pow(double a, double b) {
// if (b < 0) return 1.0 / pow(a, abs(b));
// else return pow(a, b);
return pow(a, b);
}
No overflow protection
x = (x * 10) + (* s) - '0';
may overflow on the 19th/20th digit.
Consider using double
math here instead of int64_t
.
It works now - or does it?
Also test with the string equivalent of DBL_MAX, DBL_MIN, DBL_TRUE_MIN
(or the C++ counterparts). See output of this
"1.7976931348623157e+308" "179769313486231570814527423731704356798070567525844996598917476803157260780028538760589558632766878171540458953514382464234321326889464182768467546703537516986049910576551282076245490090389328944075868508455133942304583236903222948165808559332123348274797826204144723168738177180919299881250404026184124858368.0"
"2.2250738585072014e-308" "0.00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000002225073858507201383090232717332404064219215980462331830553327416887204434813918195854283159012511020564067339731035811005152434161553460108856012385377718821130777993532002330479610147442583636071921565046942503734208375250806650616658158948720491179968591639648500635908770118304874799780887753749949451580451605050915399856582470818645113537935804992115981085766051992433352114352390148795699609591288891602992641511063466313393663477586513029371762047325631781485664350872122828637642044846811407613911477062801689853244110024161447421618567166150540154285084716752901903161322778896729707373123334086988983175067838846926092773977972858659654941091369095406136467568702398678315290680984617210924625396728515625"
"-4.9406564584124654e-324" "-0.000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004940656458412465441765687928682213723650598026143247644255856825006755072702087518652998363616359923797965646954457177309266567103559397963987747960107818781263007131903114045278458171678489821036887186360569987307230500063874091535649843873124733972731696151400317153853980741262385655911710266585566867681870395603106249319452715914924553293054565444011274801297099995419319894090804165633245247571478690147267801593552386115501348035264934720193790268107107491703332226844753335720832431936092382893458368060106011506169809753078342277318329247904982524730776375927247874656084778203734469699533647017972677717585125660551199131504891101451037862738167250955837389733598993664809941164205702637090279242767544565229087538682506419718265533447265625"
Use const
For greater application.
// double s2d(char * s) {
double s2d(const char * s) {
IEEE 754
OP commented "IEEE 754 standard was the same for c, c++ and javascript." C++ does not specify adherence to IEEE 754. Better to avoid assuming that.
"-0.0"
Code can properly return -0.0 - good. Better to do the x *= -1
after the power-of-10 scaling though to always preserve the sign.
Curious code
Two different negation approaches. I'd expect more similar code.
if (negative) x *= -1;
...
if (ne) point += e; else point -= e;
Minor: Caseless e
I'd expect 'E'
to work as well as 'e'
.
Deeper
String to
double
is a difficult function to get high quality results without extended bit-width math. Conversion fromint64_t
todouble
, prior to the multiplication/division can lose many bits.pow()
is prone to losing about 1-1.5 bits. I'd expected overall precision to be within 2 bits of the best answer (aside from overflow issues) - not too bad for basic code.IEEE Std 754-2019 discusses how many significant decimal digits should be read before effectively assuming the rest are 0 (even if not) as
H
.H >= M + 3
whereM
is 17 fordouble
, the number needed to round-tripdouble
-> text ->double
. To encode all 20 significant decimal digit takesuint67_t
or wider. Thus a quality conversion obliges better thanint64_t
math. For now, I recommend to settle for adouble x
in the significant calculation and its inherent imprecision. Later, when ready for a better conversions, look to extended math.
-
1\$\begingroup\$ @Cristian Looked it up. The
double
significant parsing should handle at least 20 decimal digits to meet IEEE Std 754-2019 standards. \$\endgroup\$chux– chux2021年08月11日 14:03:38 +00:00Commented Aug 11, 2021 at 14:03
Identifiers
Identifiers should be meaningful. What's s
? What's d
? What's e
? Maybe you should name them str
, digits_count
(or at least n_digits
) and exponent
or something like that?
The only exceptions are i
and j
as common loop variables.
One line - one statement
Don't put several statements in a single line.
if( condition )
statement1;
else
statement2;
is much more readable.
using namespace std;
is bad
It's OKish while you're writing a code like this - 50 lines in a single file; but C++ is intended for bigger projects, and mixing names from different namespaces can be at least misleading, at most dangerous.
Use standard exceptions
There's a std::invalid_argument
exception type in <stdexcept>
. You can use it or derive your own class from it. Just invalid_number
is unclear.
is_digit
is replacing default std::isdigit
Is this intended to be so?
The same for safe_pow
and std::pow
- it works with negative values fine, your problem is somewhere else.
At this point I've run your code - and it still fails on the first example, -537e+4 should be -5370000, not -537.
-
\$\begingroup\$ oops, I missed that. Good thing you found it, so I can fix it again :) \$\endgroup\$Cristian– Cristian2021年08月11日 13:34:23 +00:00Commented Aug 11, 2021 at 13:34
-
1\$\begingroup\$ @Cristian OK to post your own answer with that functional fix (and other improvements) - or post another review . (Tip: If posting a new review, give it a week for this one to get fully vetted) \$\endgroup\$chux– chux2021年08月11日 14:09:12 +00:00Commented Aug 11, 2021 at 14:09
Don’t write using namespace std;
.
You can, however, in a CPP file (not H file) or inside a function put individual using std::string;
etc. (See SF.7.)
if (!strcmp(s, "undefined0円")) {
First, a lexical string literal already contains a terminating 0円
. If you write "x"
you get a const char unnamed_variable [2] = {'x','0円'};
Second, don't use strcmp
in C++. It is confusing and difficult to use, and very inefficient compared to comparisons that know the length up front. (To be fair, sometimes I can get the compiler to optimize it when experimenting with Compiler Explorer, sometimes it will not.) Use string_view
literals so you can write:
using namespace std::literals; // at the top of your CPP file
⋮
if (s == "undefined"sv) {
static double s2d(char * s) {
It should raise a 🚩 big 🚩red 🚩 flag 🚩 that you must cast the argument to char*
any time you want to use it, including the result of string::c_str()
and a lexical string literal. Where else does a "normal" string come from? If your function is incompatible with these, there's something fundimentally wrong with it.
The type should be const char*
to fix this problem.
But, you are not writing in C here, but C++. You want to pass in a std::string
(as seen from your test code), as well as pass in a lexical string literal (like "1.23"
). To handle both of these and more efficiently, write your functions to take a std::string_view
(by value).
With a string_view
you can still use subscript notation, or get a pointer to the underlying characters, as well as getting most of the API of the string
class.
Why is s2d
defined as static
? That means it will only be seen by this translation unit, which doesn't matter in a simple program that only has one CPP file, but in real code I expect this would be in a library you actually expect to call from somewhere.
Meanwhile, the helper function safe_pow
is not, so you're not just making everything static
.
const
Besides the missing const
on your argument type, you don't use it at all anywhere in your post. You should use it a lot. For example,
const bool negative = (s[0] == '-');
const bool positive = (s[0] == '+');
Generally, make a variable const
if you can.
As for declarations in general,
int64_t x = 0, point = 0, d = 0;
It's good to see that you didn't put this at the top of the function, but where you were ready to start using them. However, it is strongly the idiom in C++ to only make one declaration per statement, and they should be on separate lines.
Test Code
You can write your test code to not require you to type each example every time you run it! And, you can automate the testing that it got the right answer. Take a look at a testing framework such as Catch2.
Rather than prompting for input, turn your comment into useful data:
constexpr string_view testdata[]= {
"-537e+4",
"+20",
"34.3e-10",
"2e5",
"-5e3",
"-2",
"-22.27e0" };
for (auto s : testdata) {
const double result= s2d(s);
// compare result against library's parser or other data
cout << result << '\n'; // until it's automated
}
0円
in your strings? Note that in the comparisons that doesn't do anything since the compare stops at the first0円
. \$\endgroup\$"123."
to be acceptable, yet it appears/[+-]?\d+(?:\.\d+)?(?:e[-+]?\d+)?/g
does not allow. \$\endgroup\$undefined
is not certainly different from both+undefined
and-undefined
. \$\endgroup\$