I'm learning C and one of the questions I've been asked is converting a string to an integer. The code I've written supports converting from a string in any base up to 16/hex to an integer. There is no over/underflow checking and it does not support lowercase hex.
int ASCIIToInteger(char *x, int base)
{
//Each element is the ASCII for it's index.
int ASCIICodes[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' }; _Bool negative = 0;
int count = 0, output = 0;
if (x[0] == '-')
{
negative = 1;
count = 1;
}
else if (x[0] == '+')
{
count = 1;
}
do
{
for (int i = 0; i < base; i++)
{
if (ASCIICodes[i] == (int)x[count])
{
output = output * base + i;
break;
}
}
count++;
} while (x[count] != 0);
if (negative == 1)
{
return ~output + 1;
}
return output;
}
Can I please get some advice regarding any possible issues or "rookie mistakes" I may have made?
6 Answers 6
Avoid mysterious bit-shifting
return ~output + 1;
Is the same as:
return - (output + 1) + 1;
And as:
return - output;
But the last one is way more obvious than the first one.
Do not hide variable declarations
int ASCIICodes[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' }; _Bool negative = 0;
If I look at the above line, the least thing in my mind is that a variable is defined after that list.
If a user has a small screen or a big font size he will be very puzzled when he will see you use negative
without seeing its definition.
Just use a newline, they are free:
int ASCIICodes[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
_Bool negative = 0;
Make variables as local as possible
I used x++
to express that I will skip the first character:
if (x[0] == '-')
{
negative = true;
x++;
}
else if (x[0] == '+')
{
x++;
}
This allows me to make count
local.
Using for
loops when practical
The outer of your loops is a do
while
, but looking around I found all three components of a for
loop, just shuffled into the code.
A for
loop will give it more organization:
for (int count = 0; x[count] != 0; count++)
{
for (int i = 0; i < base; i++)
{
if (ASCIICodes[i] == x[count])
{
output = output * base + i;
break;
}
}
}
Simplify the initial conditional
Assigning a boolean to negative
directly and using a ||
conditional looks shorter and simpler:
We go from:
_Bool negative = 0;
int count = 0, output = 0;
if (x[0] == '-')
{
negative = 1;
count = 1;
}
else if (x[0] == '+')
{
count = 1;
}
to:
bool negative = x[0] == '-';
if (x[0] == '-' || x[0] == '+')
{
x++;
}
While preserving identical functionality.
it's
vs its
... How to choose between it's and its?
This is actually really easy, do you mean "it is" or not?
Use it's
when you mean it is
.
// Each element is the ASCII for it's index.
Here you should write its
because you mean possession, intended as a property of the element.
Saying that:
// Each element is the ASCII for it is index.
Makes no sense.
So the correct comment will be:
// Each element is the ASCII for its index.
-
17\$\begingroup\$ I love the grammar police in this answer. I get sick of reading that mistake over and over in forums, texts, IMs, etc. Now I know it's in one less program, thanks to you! \$\endgroup\$Patrick Roberts– Patrick Roberts2016年01月04日 04:40:05 +00:00Commented Jan 4, 2016 at 4:40
-
3\$\begingroup\$ the other reason to avoid "mysterious bit shifting" is that (AFAICR) the use of two's complement representation of negative numbers is not actually mandated by the standard. \$\endgroup\$Alnitak– Alnitak2016年01月04日 15:14:45 +00:00Commented Jan 4, 2016 at 15:14
-
\$\begingroup\$ @Alnitak kind of... non-two-complement machines are pretty arcane -> stackoverflow.com/questions/12276957/… \$\endgroup\$Caridorc– Caridorc2016年01月04日 15:16:37 +00:00Commented Jan 4, 2016 at 15:16
-
3\$\begingroup\$ Arcane, but legal ;) In any event it's better just to let the compile do the right thing and just express negation using the operator intended for that purpose. \$\endgroup\$Alnitak– Alnitak2016年01月04日 15:18:45 +00:00Commented Jan 4, 2016 at 15:18
-
\$\begingroup\$ "it's" is also a contraction of "it has" -- it's been fun \$\endgroup\$David Conrad– David Conrad2016年01月04日 20:55:57 +00:00Commented Jan 4, 2016 at 20:55
Write one statement per line
Avoid multiple statements on the same line, for example instead of:
int ASCIICodes[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' }; _Bool negative = 0;
Break that up to multiple lines:
int ASCIICodes[] = { '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F' };
_Bool negative = 0;
This is because it's easier to read from top to bottom. If there are statements at the end of lines, that can be very distracting.
Eliminate pointless cast
No need to cast to int
here:
if (ASCIICodes[i] == (int)x[count])
Simplify ASCIICodes
A simpler way to define ASCIICodes
:
char ASCIICodes[] = "0123456789ABCDEF";
Simplify using a ternary operator
Instead of this:
if (negative) { return ~output + 1; } return output;
You could simplify using the ternary operator:
return negative ? -output : output;
Or perhaps instead of a bool negative
, you could use an int sign
:
int sign = 1;
int count = 0, output = 0;
if (x[0] == '-')
{
sign = -1;
count = 1;
}
// ....
return sign * output;
Use bool
from stdbool
Although using _Bool
is ok,
it would be a good idea to #include <stdbool.h>
,
which defines bool
(expands to _Bool
) and true
and false
,
for example:
#include <stdbool.h>
int ASCIIToInteger(char *x, int base)
{
// ....
bool negative = false;
// ...
if (x[0] == '-')
{
negative = true;
count = 1;
}
// ...
if (negative)
{
return ~output + 1;
}
You don't check for non-number characters. The sign is in the while-loop silently ignored, as any other symbol. What is the number of "HELLO" in hex? 14.
I'm surprised that no-one has pointed out the O(m * base) complexity caused by the inner loop that performs a linear scan over the (unnecessary) ASCIICodes
array.
This version of the outer loop removes the inner loop, supports lower cased characters, and also breaks the loop as soon as any illegal character is seen.
while ((n = x[count++])) {
int ch = toupper((int)n);
if ((ch >= '0' && ch <= '9') || (ch >= 'A' && ch <= 'F')) {
int digit = ch - '0';
if (digit > 9) {
digit = 10 + ch - 'A'; // convert 'A' to 10, etc
}
if (digit < base) {
output = output * base + digit;
} else {
break;
}
} else {
break;
}
}
Accept both cases by folding input through toupper()
.
#include <ctype.h>
//...
if (ASCIICodes[i] == toupper(x[count]))
But you can also define the set as a string and use strchr
to do the searching.
-
\$\begingroup\$ Note:
(int)
intoupper((int)x[count])
serves no purpose. \$\endgroup\$chux– chux2016年01月08日 10:48:15 +00:00Commented Jan 8, 2016 at 10:48 -
\$\begingroup\$ True, but if
toupper
is a macro, this will suppress the occasional warning about indexing with a char type. \$\endgroup\$luser droog– luser droog2016年03月05日 23:30:30 +00:00Commented Mar 5, 2016 at 23:30 -
\$\begingroup\$ The warning about "indexing with a char type" should not be suppressed by casting to
int
when usingtoupper()
. That leads to UB whenchar
is signed. Better to usetoupper((unsigned char)x[count])
. C only definestoupper()
for values in theunsigned char
range andEOF
. \$\endgroup\$chux– chux2016年03月06日 00:04:33 +00:00Commented Mar 6, 2016 at 0:04 -
\$\begingroup\$ Good point. Thanks. I have removed the cast from my answer, and will be more cautious about this in future. \$\endgroup\$luser droog– luser droog2016年03月06日 02:52:47 +00:00Commented Mar 6, 2016 at 2:52
My version:
#include <ctype.h>
int atoib(const char *x, int base)
{
unsigned s = 0, v = 0;
unsigned char c = *x++;
if (c == '-')
s++, c = *x++;
else if (c == '+')
c = *x++;
if (base <= 10) {
if (base >= 2) {
unsigned M = '0' + base - 1;
while (c >= '0' && c <= M) {
v = v * base + c - '0';
c = *x++;
}
}
} else {
if (base <= ('Z' - 'A' + 11)) {
unsigned M = 'A' + base - 11;
for (;;) {
if (c >= '0' && c <= '9') {
v = v * base + c - '0';
} else {
c = toupper(c);
if (c >= 'A' && c <= M)
v = v * base + c - 'A' + 10;
else
break;
}
c = *x++;
}
}
}
return (v ^ -s) + s;
}
Key points:
- allow base values greater than 16;
- check for valid base values (from 2 up to 36);
- allow lowercase and uppercase digits (from a|A up to z|Z);
- bail out on digits invalid for the given base and any non-digit;
- avoid terminating null checks, the digit validation does it for free;
- use arithmetic instead of lookup table to find the digit value (by the way, if you want to use a lookup table, then define it as static, otherwise it will be created on the stack and initialised on every invocation of your function);
- the funny return expression saves one branch for two's-complement integers, if you are afraid of ever running on a non-two's-complement platform or just strive for clarity, then do
return s ? -v : v;
instead;
The proper handling of cases 2 and 4 still need some improvement, perhaps with errno
assignment or by changing the interface to something akin to strtol
or, if C++
is allowed, by throwing an exception. Also it might make sense to handle integer overflow errors.
The test code:
#include <stdio.h>
int
main()
{
printf("%d %d %d %d %d %d %d %d\n",
atoib("10", 0), atoib("10", 1), atoib("10", 2), atoib("10", 9),
atoib("10", 10), atoib("10", 11), atoib("10", 36), atoib("10", 37));
printf("%d %d %d %d %d %d %d %d\n",
atoib("8", 9), atoib("9", 9), atoib("9", 10), atoib("A", 10),
atoib("A", 11), atoib("B", 11), atoib("Z", 36), atoib("[", 36));
printf("%d %d\n", atoib("-2", 10), atoib("+2", 10));
return 0;
}
strtol
(2) function? \$\endgroup\$