Implementation of itoa which allocates the string

Question 1

This is my implementation of itoa() (Integer to Alpha), which converts an integer to a string. Memory management and optimization is important. The caller is not responsible for allocation of the string. The caller is responsible for freeing the string, as my implementation does not allow to call free(). The function handles everything internally, except freeing. The function allocates memory twice, for the string to be returned and the int array where the digits are stored.

I am concerned about memory leaks with the allocation of the int array which stores the digits. Is there any way to optimize / secure the memory allocation so I do not get leaks?

The code:

#include <stdlib.h>
static int digitcount(int n)
{
 int i = 0;
 if (n == 0)
 return (1);
 while (n != 0)
 {
 n /= 10;
 i++;
 }
 return (i);
}
static int *revarray(int *a, int size)
{
 int tmp = 0;
 int i = 0;
 while (i < size / 2)
 {
 tmp = a[i];
 a[i] = a[size - i - 1];
 a[size - i - 1] = tmp;
 i++;
 }
 return (a);
}
static int *getdigits(int n, int size)
{
 int *a;
 int r = 0;
 int i = 0;
 a = malloc(size * sizeof(int));
 if (a == NULL)
 return (NULL);
 if (n == 0)
 a[i] = 0;
 while (n != 0)
 {
 r = n % 10;
 if (r < 0)
 r *= -1;
 a[i] = r;
 i++;
 n /= 10;
 }
 a = revarray(a, size);
 return (a);
}
char *itoa(int n)
{
 char *nbr;
 int *digits;
 int digits_size;
 int i;
 int j;
 digits_size = digitcount(n);
 digits = getdigits(n, digits_size);
 if (n < 0)
 digits_size++;
 nbr = malloc(digits_size + 1 * sizeof(char));
 if (nbr == NULL || digits == NULL)
 return (NULL);
 i = 0;
 j = 0;
 if (n < 0)
 nbr[i++] = '-';
 while (i < digits_size)
 {
 nbr[i] = digits[j] + '0';
 i++;
 j++;
 }
 nbr[i] = 0;
 return (nbr);
}

EDIT: For practice and understanding I wanted to implement my own itoa() instead of just calling sprintf() or using external libraries.

Question 2

Memory management and optimization is important.

I will avoid any stylistic remark and focus on this particular aspect of your query.

Memory

First of all, the first rule of optimization is to avoid allocating memory. In general, memory allocation is slow-ish, and memory de-allocation even slower. In particular, most malloc implementation will perform "consolidation" work during the call to free which can take a few milliseconds. (The average case isn't that slow, but the worst case can be bad).

As a result, not allocating is generally much faster than allocating. And if you do insist on allocating, at the very least you should only allocate once for the final result, and not allocate for any intermediate value.

Hence the ideal API:

char* my_itoa(char* buffer, int i);

Number of digits

Your method is naive, and therefore slow.

In general, division and modulo are the slowest arithmetic operations on a CPU. For x64, a 64-bits division is 30 to 90 cycles when an addition is 1 cycle and a multiplication 3 cycles. 32-bits divisions fare a bit better, but not by much.

Of course, the compiler will perform strength reduction, and replace the division/modulo by a constant with a series of additions/multiplications/shifts... but even then it'll still be a slow-ish operation. (Especially for signed int)

Luckily for you, there was a recent spree on Internet on the fastest way to compute the number of digits.

I have myself benchmarked this method, and its performance is truly good. It takes 3 CPU cycles (just like a single multiplication) regardless of the value of the integer. I would encourage simply using it:

/* Note: GCC builtin, other compilers should have an equivalent method
 to count the number of leading zeroes */
int int_log2(uint32_t x) { return 31 - __builtin_clz(x|1); }
int numberDigits(uint32_t x)
{
 static uint64_t const TABLE[] = {
 4294967296, 8589934582, 8589934582, 8589934582,
 12884901788, 12884901788, 12884901788, 17179868184,
 17179868184, 17179868184, 21474826480, 21474826480,
 21474826480, 21474826480, 25769703776, 25769703776,
 25769703776, 30063771072, 30063771072, 30063771072,
 34349738368, 34349738368, 34349738368, 34349738368,
 38554705664, 38554705664, 38554705664, 41949672960,
 41949672960, 41949672960, 42949672960, 42949672960
 };
 return (int)((x + TABLE[int_log2(x)]) >> 32);
}

Reversal of Array

This is mutually exclusive with computing the number of digits.

In order to obtain the digits, you have to use % 10 (or % 100 to get them 2 by 2), and this means getting the lowest digits first.

There are 3 strategies to deal with this:

Write the digits in reverse order in a temporary array:
- Either reverse them in place and bulk-copy them to the destination.
- Or copy them one at a time to the destination, reversing as you go.
Write the digits (as ASCII char, not int) in order in a large enough array, then bulk-copy them to the destination.
Compute the number of digits, then write the digits in order in the destination, starting from the accurate final place.

In general, the reversal strategy is the worse performance wise. Element-by-element manipulation loses out quickly as the number of elements go up.

The difference between the last two hinges on the performance of bulk-copying vs computing the number of digits, and the latter should win if well-implemented.

At the moment, you are using the worst possible combination: computing number of digits (slowly), discarding the result, and using the reversal strategy.

That's a lot of unnecessary work; performance suffers accordingly.

Performance Goal

I was, in fact, just this week benchmarking our number-formatting routines, after implementing a variant of the faster "number of digits" computation.

On my computer (4.6 GHz), formatting a 64 bits unsigned integer with the algorithm I use takes:

6 ns for 1.
14 ns for UINT64_MAX

Note: for reference sprintf takes 74 ns to do the same on my computer.

There's nothing magic there:

Compute number of digits: 1 ns/1.2 ns.
Loop until 0, % 100 at a time, writing in-place in the destination buffer.
Handle last digit, if any.

The destination buffer is user-provided, of course, to avoid any memory allocation.

I would argue this is the order of magnitude you should aim for, if performance matters to you.

Question 3

How do I print an integer in Assembly Level Programming without printf from the c library? has some links at the bottom to blogs about fast itoa. (e.g. doing x /= 100 steps and splitting that up, to gain some instruction-level parallelism.) pvk.ca/Blog/2017/12/22/… / tia.mat.br/posts/2014/06/23/integer_to_string_conversion.html

Question 4

Storing directly into the destination does nicely avoid store-forwarding stalls which costs latency. Note that out-of-order exec will overlap the num_digits latency with the multiply latency. bsr -> load -> add -> shr is at least 10 cycle latency on an Intel CPU (3 + 5 + 1 + 1), assuming L1d hit for the table. So not 1 ns, but performance of tiny sequences of instructions must be considered as latency and front-end / back-end throughput costs separately, not one-dimensional ns or cycles to add up. (I'd believe that it costs ~1 ns tput vs. a func that formats from the end of a buffer.)

Question 5

You might actually use an API like char *atoi_end(char *buff_end, unsigned x); to avoid needing to count at all (like in the asm Q&A I linked). glibc actually does that internally. But that's probably only useful if you're going to feed it to a syscall (like in the linked asm question) or a function like puts. If you were going to copy without truncating to a fixed length, you'd want to just do a normal atoi to write the digits to their final location.

Question 6

I was trying to figure out if there was some use for a rougher inexact count, but if you keep the mallocing API it probably only makes sense if the string does start at the front of the buffer. So you need an exact count, either via a tmp buffer (length for free) and memcpy or ahead of allocation. Perhaps getting some multiply latency into the pipeline before the call to malloc could help some with latency. Using the calculated exact length as a loop termination condition could allow the loop-exit branch prediction to be checked earlier than the x /= 10 dep chain..

Question 7

"fair a bit better" -> "fare a bit better" :)

Question 8

Counting the digits doubles the amount of work, and the division/modulo is indeed very hard work for the CPU.

You know the maximum size of the resulting string, because it would be the largest or most negative integer you can be passed. So use a local array as a buffer, and then copy the result to allocated memory.

You have a separate function to reverse the string. Instead, fill your buffer from the right, decrementing the pointer as you write each digit.

Instead of checking for a negative remainder each time through the loop, check for negative input once at the beginning. If it's negative, remember that and change to positive. Then when done, finish off with a - character if your is-negative flag is set.

Question 9

"If it's negative, remember that and change to positive." is a problem when n == INT_MIN, as -INT_MIN is undefined behavior with 2's compliment.

Question 10

@chux-ReinstateMonica I didn't want to digress with details. Either notice that as a special singular case, or have the base function use unsigned or wider types (I've seen C standard library in MSVC where the only implementation was the longest unsigned type and other types were thin wrappers that called it. That's slower than a modern template version, BTW)

Question 11

INT_MIN does not need to be a special case or resort to other types or suffer performance. Simply fold positive values to negative ones and use the n%10 result of [-9...0] to form the digit character.

Question 12

Until very recent standards of C++, the behavior of % for negative numbers was implementation-defined. I don't know if that's still true in C compilers. Historically, we avoided the negative value when implementing itoa etc. In any case, my comment was on the conditional branch needed to form the absolute value being in the middle of the loop.

Question 13

@PeterCordes You may enjoy this collection of such asserts: Detecting unicorn and dinosaur compilers - only cost 0ドル.05.

Question 14

Besides lots of optimizations other people have suggested, you have two memory leaks.

The first is that, just before the final return in itoa(), you need to write free(digits);

The second is in the error handling. (A common place for errors BTW.) In particular, if getdigits returns successfully, but the malloc fails, you leak digits. The fix: Test digits before the malloc and just return. Test nbr after the malloc, and free(digits); and return if it failed.

Question 15

I think this answer addresses the question better than the others, which are more focused on the implementation rather than the memory leaks the poster is concerned about.

Question 16

You don't need an array of int digits in the first place, though, and if you do, it can be a local in automatic storage because the max digit-length of an int is small. Automatic storage is fast and immune to leaks. A local char buf[sizeof(int)*CHAR_BIT / 3] or something would make even more sense (the number of bits in an int / 3 is a loose bound on the max length). (edit: chux's answer already suggests that.)

Question 17

@PeterCordes Well, I did say "lots of optimizations". I wrote my answer just to spell out the leaks in the current code. Personally, for this functionality, I would make a local buffer as a char array, flll it from the right, and return strdup() of my final pointer. (I also wouldn't write this prototype&semantics.) As for the array size, I recommend adding 2, one for the trailing nul, and one for the leading minus sign. Especially since on a 16 bit int platform, your array would be 5 characters and there could be 5 digits, and similarly for a 32 bit int platform, 10 and 10.

Question 18

You have used static for the helper functions, to give them internal linkage. That's good.

 return (1);

The parentheses don't add any value here. Just write return 1; like everyone else.

We can avoid the need to special-case n == 0 by always performing at least one division; use do/while in place of while:

int i = 0;
do {
 n /= 10;
 ++i;
} while (n);

The similar test in getdigits() can be eliminated in the same way.

int i = 0;
while (i < size / 2)
{
 ...
 i++;
}

We have a construct that expresses that better:

for (int i = 0; i < size / 2; ++i) {
 ...
}

The variable tmp doesn't need to be at function scope; it can be declared within the loop:

static int *revarray(int *a, int size)
{
 for (int i = 0; i < size / 2; ++i) {
 int tmp = a[i];
 a[i] = a[size - i - 1];
 a[size - i - 1] = tmp;
 }
 return a;
}

 a = malloc(size * sizeof(int));

A convention that's often helpful is to write the size in terms of the pointer being assigned to. That helps readers immediately see that it's consistent, without having to look for the declaration of a:

 a = malloc(sizeof *a * size);

I've written the sizeof expression first - it makes no difference here, but can avert overflow when there's a multiplication for the number of elements: size_t * int * int is safer than int * int * size_t.

 a = revarray(a, size);

That's a pointless assignment, since revarray() returns the unchanged value of a. We should probably make revarray() return void.

digits = getdigits(n, digits_size);
nbr = malloc(digits_size + 1 * sizeof(char));
if (nbr == NULL || digits == NULL)
 return (NULL);

Oops - what if digits is a valid pointer, but nbr is null? Then we leak digits. We need a call to free() in there (remember that free(NULL) is defined, so it can be an unconditional free(digits)).

Multiplying by sizeof (char) is a no-op: since sizeof yields values in units of char, it follows that sizeof (char) must be 1.

 while (i < digits_size)

Again, this should be a for loop:

 for (int i = 0, j = 0; i < digits_size; ++i, ++j)

We forgot to free(digits) before we return.

As other reviewers have pointed out, the algorithm is inefficient, with the extra measuring and copying, and the reversal of digit order. I'll not repeat those observations here.

Question 19

Leak

digits is never free'd.

If either one of digits or nbr return NULL, the other allocation is lost.

digits = getdigits(n, digits_size);
...
nbr = malloc(digits_size + 1 * sizeof(char));
if (nbr == NULL || digits == NULL)
 return (NULL);

Concept error

sizeof(char) is 1 so not much of a issue, yet sizeof(char) should multiply by the sum.

// malloc(digits_size + 1 * sizeof(char));
malloc((digits_size + 1) * sizeof(char));

Simplification

Use a do { } while();

//if (n == 0) return (1);
//while (n != 0) {
// n /= 10;
// i++;
//}
do {
 n /= 10;
 i++;
} while (n != 0);

Allocate to the refenced object

Rather than code the type, use the referenced object. Easy to code right, review and maintain.

// a = malloc(size * sizeof(int));
a = malloc(size * sizeof *a);

Good use of static functions

Use a local buffer for the reverse

No need to free it then either. INT_STRING_SIZE details

#define INT_STRING_SIZE ((sizeof(int)*CHAR_BIT - 1)*28/93 + 3)
// int *digits;
int digits[INT_STRING_SIZE];

digits is now local storage and will be reclaimed on function exit. It's OK if it is more than needed for a given n, just as long as it can handle the worst case n == INT_MIN.

Simplified approach

Is there any way to optimize / secure the memory allocation so I do not get leaks?

Yes. Form the string in a local buffer of worst case size (not an allocated one). Fill from right to left so no need for a later reverse. Now armed with the known size, allocate and strcpy().

// Pseudo code
char *itoa_alloc(int n) {
 int n_origin = n;
 char buf[INT_STRING_SIZE];
 char *p = Address of end of buf
 *p = null character;
 do 
 p--;
 *p = |n%10| + '0'; // Various ways to avoid repeated sign test exist
 n /= 10;
 } while (n != 0);
 if (n_origin < 0)
 *(--p) = '-';
 char *s = malloc(end_of buf - p + 1);
 if (s == NULL) return NULL // Handle_error
 strcpy(s, p);
 // **OR**
 memcpy(s, p, end_of buf - p + 1);
 return s;
}

With C2x, looks like strdup() may arrive and so could replace

 char *s = malloc(end_of buf - p + 1);
 if (s == NULL) return NULL // Handle_error
 strcpy(s, p);
 return s;
 // **OR** C2x or select implementations today
 return strdup(p);

Sample related code

Question 20

You have the length so you can use memcpy instead of strcpy. If you're going to str functions to favour simplicity over speed, you could just use strdup (although that POSIX, not ISO C).

Question 21

You can do unsigned absolute value safely in C with unsigned n = n<0 ? 0U - n : n;, allowing you to hoist that work out of the loop and use more-efficient unsigned division. The implicit (unsigned)n conversion is guaranteed to use modulo-reduction to the range of unsigned without signed-overflow, correctly handling INT_MIN even on 2's complement systems.

Question 22

@PeterCordes Note posted code, although C-like, is // Pseudo code. I did not want to give OP a cut/paste solution here, but a guide, to allow OP the specific coding experience. Agree about bring the test out of the loop and memcpy() and that is what is done in linked sample with return memcpy(dest, p, size_used);

Question 23

The standard functions in string.h are not using malloc and free almost all. Therefore, the interface of itoa might be changed like this:

/* NG - because itoa returns a new memory */
char *itoa(int n) { ... }
/* OK - because itoa uses an existing memory */
void itoa(char *s, int n) { ... }

Question 24

Completely rewriting OP's code does not constitute a code review.

Question 25

The suggested two parameter form is bad. What if somebody writes: char s[2]; itoa(s,1000); ? You need to either provide a length parameter or enforce a length, perhaps with a structure, like: struct itoa_param { char buf[14]; }; extern void itoa(struct itoa_param *s, int n); It also might be good to return the start of the string. This can simplify things be making the returned string not required to be the start of the buffer.

Question 26

Not so bad, I think. It's the same as char s[2]; sprintf(s, "%d", n);. Almost C standard functions are unsafe. For real safe programming, we know about the potential unsafeness of C language. In this case, we talk about memory leaks. Array boundary problem is also important but not now's topic.

Question 27

@HaruoWakakusa: That's why there is snprintf.

Matthieu M. 6,4291 gold badge22 silver badges34 bronze badges · Accepted Answer · 2021-07-14 14:05:10Z

Memory management and optimization is important.

I will avoid any stylistic remark and focus on this particular aspect of your query.

Memory

First of all, the first rule of optimization is to avoid allocating memory. In general, memory allocation is slow-ish, and memory de-allocation even slower. In particular, most malloc implementation will perform "consolidation" work during the call to free which can take a few milliseconds. (The average case isn't that slow, but the worst case can be bad).

As a result, not allocating is generally much faster than allocating. And if you do insist on allocating, at the very least you should only allocate once for the final result, and not allocate for any intermediate value.

Hence the ideal API:

char* my_itoa(char* buffer, int i);

Number of digits

Your method is naive, and therefore slow.

In general, division and modulo are the slowest arithmetic operations on a CPU. For x64, a 64-bits division is 30 to 90 cycles when an addition is 1 cycle and a multiplication 3 cycles. 32-bits divisions fare a bit better, but not by much.

Of course, the compiler will perform strength reduction, and replace the division/modulo by a constant with a series of additions/multiplications/shifts... but even then it'll still be a slow-ish operation. (Especially for signed int)

Luckily for you, there was a recent spree on Internet on the fastest way to compute the number of digits.

I have myself benchmarked this method, and its performance is truly good. It takes 3 CPU cycles (just like a single multiplication) regardless of the value of the integer. I would encourage simply using it:

/* Note: GCC builtin, other compilers should have an equivalent method
 to count the number of leading zeroes */
int int_log2(uint32_t x) { return 31 - __builtin_clz(x|1); }
int numberDigits(uint32_t x)
{
 static uint64_t const TABLE[] = {
 4294967296, 8589934582, 8589934582, 8589934582,
 12884901788, 12884901788, 12884901788, 17179868184,
 17179868184, 17179868184, 21474826480, 21474826480,
 21474826480, 21474826480, 25769703776, 25769703776,
 25769703776, 30063771072, 30063771072, 30063771072,
 34349738368, 34349738368, 34349738368, 34349738368,
 38554705664, 38554705664, 38554705664, 41949672960,
 41949672960, 41949672960, 42949672960, 42949672960
 };
 return (int)((x + TABLE[int_log2(x)]) >> 32);
}

Reversal of Array

This is mutually exclusive with computing the number of digits.

In order to obtain the digits, you have to use % 10 (or % 100 to get them 2 by 2), and this means getting the lowest digits first.

There are 3 strategies to deal with this:

Write the digits in reverse order in a temporary array:
- Either reverse them in place and bulk-copy them to the destination.
- Or copy them one at a time to the destination, reversing as you go.
Write the digits (as ASCII char, not int) in order in a large enough array, then bulk-copy them to the destination.
Compute the number of digits, then write the digits in order in the destination, starting from the accurate final place.

In general, the reversal strategy is the worse performance wise. Element-by-element manipulation loses out quickly as the number of elements go up.

The difference between the last two hinges on the performance of bulk-copying vs computing the number of digits, and the latter should win if well-implemented.

At the moment, you are using the worst possible combination: computing number of digits (slowly), discarding the result, and using the reversal strategy.

That's a lot of unnecessary work; performance suffers accordingly.

Performance Goal

I was, in fact, just this week benchmarking our number-formatting routines, after implementing a variant of the faster "number of digits" computation.

On my computer (4.6 GHz), formatting a 64 bits unsigned integer with the algorithm I use takes:

6 ns for 1.
14 ns for UINT64_MAX

Note: for reference sprintf takes 74 ns to do the same on my computer.

There's nothing magic there:

Compute number of digits: 1 ns/1.2 ns.
Loop until 0, % 100 at a time, writing in-place in the destination buffer.
Handle last digit, if any.

The destination buffer is user-provided, of course, to avoid any memory allocation.

I would argue this is the order of magnitude you should aim for, if performance matters to you.

How do I print an integer in Assembly Level Programming without printf from the c library? has some links at the bottom to blogs about fast itoa. (e.g. doing x /= 100 steps and splitting that up, to gain some instruction-level parallelism.) pvk.ca/Blog/2017/12/22/… / tia.mat.br/posts/2014/06/23/integer_to_string_conversion.html
Storing directly into the destination does nicely avoid store-forwarding stalls which costs latency. Note that out-of-order exec will overlap the num_digits latency with the multiply latency. bsr -> load -> add -> shr is at least 10 cycle latency on an Intel CPU (3 + 5 + 1 + 1), assuming L1d hit for the table. So not 1 ns, but performance of tiny sequences of instructions must be considered as latency and front-end / back-end throughput costs separately, not one-dimensional ns or cycles to add up. (I'd believe that it costs ~1 ns tput vs. a func that formats from the end of a buffer.)
You might actually use an API like char *atoi_end(char *buff_end, unsigned x); to avoid needing to count at all (like in the asm Q&A I linked). glibc actually does that internally. But that's probably only useful if you're going to feed it to a syscall (like in the linked asm question) or a function like puts. If you were going to copy without truncating to a fixed length, you'd want to just do a normal atoi to write the digits to their final location.
I was trying to figure out if there was some use for a rougher inexact count, but if you keep the mallocing API it probably only makes sense if the string does start at the front of the buffer. So you need an exact count, either via a tmp buffer (length for free) and memcpy or ahead of allocation. Perhaps getting some multiply latency into the pipeline before the call to malloc could help some with latency. Using the calculated exact length as a loop termination condition could allow the loop-exit branch prediction to be checked earlier than the x /= 10 dep chain..

Stack Exchange Network

Implementation of itoa which allocates the string

6 Answers 6

Memory

Number of digits

Reversal of Array

Performance Goal

You must log in to answer this question.

Linked

Hot Network Questions

Implementation of itoa which allocates the string

6 Answers 6

Memory

Number of digits

Reversal of Array

Performance Goal

You must log in to answer this question.

Linked

Related

Hot Network Questions