C code to convert hexadecimal string to base64

Question 1

I've written a program to convert a hex-encoded string to base64, and here is my code. My main concerns are:

Optimizations - Is my code sufficiently optimized and if any more optimization is possible.
Memory leaks - Is there any scope for memory leaks.
Readability

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <string.h>
#include <math.h>
/*
Macros for : 
1. Returning integer value of the equivalent hex digit.
2. Returning base64 digit for a decimal value.
3. Checks if two numbers are equal or not.
*/
#define return_hex(a) ((a >= 97) ? (a - 97 + 10) : (a - 48))
#define return_base64(a) (((a < 26) ? 1 : 0) == 1 ? (a + 65) : (((a < 52) ? 1 : 0) == 1) ? (a + 97 - 26) : (((a < 62) ? 1 : 0) == 1) ? (a + 48 - 52) : (((a == 62) ? 1 : 0) == 1) ? 43 : 47)
#define check_not_equal(a, b) ((a == b) ? false : true)
/*
Helper function to convert an 8 bit integer array to a base64 string.
params : unsigned 8 bit integer array, array size
returns : string
*/
char *convert_helper(uint8_t *hex_array, int n) {
 uint8_t mask1 = 252, mask2 = 240, mask3 = 192, c;
 char *base64 = malloc(sizeof(char) * n);
 int size = 0, temp = 0;
 for(int i = 0; i < n; i += 3, temp++) {
 if(check_not_equal(size++, n)) {
 c = (mask1 & hex_array[i]) >> 2;
 base64[temp * 4] = return_base64(c);
 }
 else
 break;
 if(check_not_equal(size++, n)) {
 c = (((255 - mask1) & hex_array[i]) << 4) | ((mask2 & hex_array[i + 1]) >> 4);
 base64[temp * 4 + 1] = return_base64(c);
 }
 else
 break;
 if(check_not_equal(size++, n)) {
 c = (((255 - mask2) & hex_array[i + 1]) << 2) | ((mask3 & hex_array[i + 2]) >> 6);
 base64[temp * 4 + 2] = return_base64(c);
 }
 else
 break;
 if(check_not_equal(size++, n)) {
 c = (255 - mask3) & hex_array[i + 2];
 base64[temp * 4 + 3] = return_base64(c);
 }
 else
 break;
 } 
 return base64;
}
/*
Function to convert a hexadecimal-encoded string to a base64-encoded string. 
params : hexadecimal-encoded string, string size
returns : string
*/
char *convert_to_base64(char *hex_string, int n) {
 int size = ceil((n * 4) / 6.0);
 uint8_t *hex_binary_array = malloc(sizeof(char) * size);
 for(int i = 0; i < n; i += 2) {
 if(i == n - 1) {
 hex_binary_array[i / 2] = return_hex(hex_string[i]) << 4;
 }
 else {
 uint8_t first = return_hex(hex_string[i]), second = return_hex(hex_string[i + 1]);
 hex_binary_array[i / 2] = first << 4 | second;
 }
 }
 
 char *base64_string = convert_helper(hex_binary_array, size);
 free(hex_binary_array);
 return base64_string;
}
int main()
{
 /*
 //Test for returning base64 values. 
 
 printf("%c", return_base64(4)); //returns E 
 printf("%c", return_base64(19)); //returns T
 printf("%c", return_base64(37)); //returns l
 printf("%c", return_base64(49)); //returns x
 printf("%c", return_base64(54)); //returns 2
 printf("%c", return_base64(59)); //returns 7
 printf("%c", return_base64(62)); //returns +
 printf("%c", return_base64(63)); //returns /
 */
 
 
 char *string;
 /*
 //Test 1
 string = convert_to_base64("deadbeef", 8);
 printf("%s\n", string);
 printf("%d", strcmp("3q2+7w", string));
 free(string);
 
 
 //Test 2
 string = convert_to_base64("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d", 96);
 printf("%s\n", string);
 printf("%d", strcmp("SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t", string));
 free(string);
 */
 
 string = convert_to_base64("fd347a67d1e", 11);
 printf("%s\n", string);
 return 0;
}

Question 2

Don't hardcode (for readability) ASCII constants. Use 'a' not 97. Perhaps email me to [email protected] for more

Question 3

Headers

I don't see any need for <string.h>. We can drop <stdbool.h> if we rewrite check_not_equal:

#define check_not_equal(a, b) ((a) != (b))

Personally, I wouldn't use a macro for that - it just obfuscates the code.

<stdio.h> is used only by the tests. We may compile them separately if we want to use the function in other programs.

And we can eliminate the need for <math.h> using a well-known technique for integer ceiling divide:

int ceil_div(int n, int m) {
 /* assumes n+m-1 doesn't overflow */
 /* A bulletproof version is only slightly more complex. */
 return (n+m-1) / m;
}

That's likely more efficient than using floating-point, too.

Macros

Don't use macros where functions would work equally well or better. In particular, we have lots of macros here that expand their arguments multiple times, which can cause surprise when they have side-effects.

Always use all-caps names for macros, as a warning to the reader that they are not functions.

Using functions is the right thing here - any decent compiler will inline for you when that produces better code.

And what's the use of the ternary expression like (((a < 52) ? 1 : 0) == 1) ? That's simply (a < 52).

It's simpler, clearer and more efficient just to index a lookup table for each of those:

/* a in "0123456789abcdef" */
static int from_hex(char c)
{
 switch (c) {
 default: return c - '0'; /* digits are guaranteed consecutive */*/
 case 'a': case 'A': return 10;
 case 'b': case 'B': return 11;
 case 'c': case 'C': return 12;
 case 'd': case 'D': return 13;
 case 'e': case 'E': return 14;
 case 'f': case 'F': return 15;
 }
}
/* 0 <= a <= 63 */
static char to_base64(int a)
{
 return "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/+"[a];
}

Or just declare the strings, and let the code do the indexing.

Don't use fixed-size types unnecessarily

uint8_t isn't available everywhere (notably, whenever CHAR_BIT is greater than 8). But we don't exactly 8 bits - we can use uint8_fast_t or simply unsigned char for our purposes.

Character set assumptions

The code seems to assume that the output characters are ASCII-encoded (which includes Latin-1 and UTF-8, but not EBCDIC). Worse than that, it uses decimal code-point numbers (e.g. 65) where character constants ('a') would be easier to understand.

Don't assume allocation succeeds

malloc() can return a null pointer. Dereferencing a null pointer is undefined behaviour, so just report that and exit early; the usual way is to return a null pointer to mean "could not allocate":

char *base64 = malloc(n);
if (!base64) { return base64; }

Note that multiplying by sizeof (char) is pointless, since that's 1 by definition (sizeof yields the number of char needed to represent a type).

Control flow

This structure can be simplified:

 if (size++ != n) {
 c = (mask1 & hex_array[i]) >> 2;
 base64[temp * 4] = return_base64(c);
 } else
 break;

It's good practice to use braces {...} even for single-statement blocks, and especially to be consistent both sides of else. But we don't need else here if we invert the test:

 if (size++ == n) {
 break;
 }
 c = (mask1 & hex_array[i]) >> 2;
 base64[temp * 4] = return_base64(c);

Test program

It's good that we have tests. We can make it better by having self-checking tests instead of needing to inspect program output.

We could write that with a couple of helper functions:

#include <stdio.h>
static unsigned test_base64_char(const char *file, int line,
 int n, char expected)
{
 char actual = return_base64(n);
 if (actual == expected) {
 return 0;
 }
 fprintf(stderr, "%s:%d: return_base64(%d) returned '%c'; expected '%c'\n",
 file, line, n, actual, expected);
 return 1;
}
static unsigned test_base64_str(const char *file, int line,
 const char *input, const char *expected)
{
 char *actual = convert_to_base64(input);
 if (!actual) {
 fprintf(stderr, "%s:%d: return_base64(\"%s\") returned null; expected \"%s\"\n",
 file, line, input, expected);
 return 1;
 }
 if (!strcmp(actual, expected)) {
 free(actual);
 return 0;
 }
 fprintf(stderr, "%s:%d: return_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
 file, line, input, actual, expected);
 free(actual);
 return 1;
}

Some macros to add the source location of the tests:

#define TEST_BASE64_CHAR(n, expected) \
 failures += test_base64_char(__FILE__, __LINE__, n, expected)
#define TEST_BASE64_STR(n, expected) \
 failures += test_base64_str(__FILE__, __LINE__, n, expected)

And a main() function to run them:

int main(void)
{
 unsigned failures = 0;
 TEST_BASE64_CHAR(4, 'E');
 TEST_BASE64_CHAR(19, 'T');
 TEST_BASE64_CHAR(37, 'l');
 TEST_BASE64_CHAR(49, 'x');
 TEST_BASE64_CHAR(54, '2');
 TEST_BASE64_CHAR(59, '7');
 TEST_BASE64_CHAR(62, '+');
 TEST_BASE64_CHAR(63, '/');
 TEST_BASE64_STR("deadbeef", "3q2+7w");
 TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
 "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
 printf("Tests completed with %u failures.\n", failures);
 return failures > 0;
}

Temporary array

The code works in two phases, reflected in the two functions, with a helper array allocated to hold the binary representation of the hex string. But that helper is unnecessary - we should be able to work directly from the hex chars, converting each group of three into two base-64 chars.

/*
 Function to convert a hexadecimal-encoded string to a base64-encoded string.
 params : hexadecimal-encoded string
 returns : string - or a null pointer if allocation failed
The return value must be deallocated using free()
*/
char *convert_to_base64(const char *hex_string)
{
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 + 1;
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return NULL;
 }
 char *out = out_buf;
 while (hex_string[0] && hex_string[1] && hex_string[2]) {
 /* convert three hex digits to two base-64 chars */
 int digit[3] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1]),
 from_hex(hex_string[2])
 };
 *out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
 *out++ = to_base64(((digit[1] & 3) << 4) + (digit[2]));
 hex_string += 3;
 }
 /* Now deal with leftover chars */
 if (hex_string[0] && hex_string[1]) {
 /* convert three hex digits to two base-64 chars */
 int digit[2] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1])
 };
 *out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
 *out++ = to_base64(((digit[1] & 3) << 4));
 } else if (hex_string[0]) {
 /* convert three hex digits to two base-64 chars */
 int digit = from_hex(hex_string[0]);
 *out++ = to_base64(digit << 2);
 }
 *out = '0円';
 return out_buf;
}

Output format ambiguity

There's no indication of how many blank characters were encoded, unlike standard base-64 formats, which use one or two trailing = to indicate this. That will break round-trip of inputs such as "1d", which is indistinguishable from "1d0" after encoding (both produce `"HQ" as output).

Modified code

Header file hex_to_base64.h:

#ifndef HEX_TO_BASE64_H
#define HEX_TO_BASE64_H
/*
 Function to convert a hexadecimal-encoded string to a base64-encoded string.
 params : hexadecimal-encoded string
 returns : string - or a null pointer if allocation failed
 The return value must be deallocated using free().
*/
char *hex_to_base64(const char *hex_string);
#endif

Implementation:

#include "hex_to_base64.h"
#include <stdlib.h>
#include <string.h>
/* a in "0123456789abcdef" */
static int from_hex(char c)
{
 switch (c) {
 default: return c - '0'; /* digits are guaranteed consecutive */
 case 'a': case 'A': return 10;
 case 'b': case 'B': return 11;
 case 'c': case 'C': return 12;
 case 'd': case 'D': return 13;
 case 'e': case 'E': return 14;
 case 'f': case 'F': return 15;
 }
}
char *hex_to_base64(const char *hex_string)
{
 static const char base64[] =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
 + 2 - (input_len - 1) % 3 /* padding '=' or '==' */
 + 1; /* string terminator */
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return NULL;
 }
 char *out = out_buf;
 while (hex_string[0] && hex_string[1] && hex_string[2]) {
 /* convert three hex digits to two base-64 chars */
 int digit[3] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1]),
 from_hex(hex_string[2])
 };
 *out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
 *out++ = base64[((digit[1] & 3) << 4) + (digit[2])];
 hex_string += 3;
 }
 /* Now deal with leftover chars */
 if (hex_string[0] && hex_string[1]) {
 /* convert two hex digits to two base-64 chars */
 int digit[2] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1])
 };
 *out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
 *out++ = base64[((digit[1] & 3) << 4)];
 *out++ = '=';
 } else if (hex_string[0]) {
 /* convert one hex digit to one base-64 char */
 int digit = from_hex(hex_string[0]);
 *out++ = base64[digit << 2];
 *out++ = '=';
 *out++ = '=';
 }
 *out = '0円';
 return out_buf;
}

Tests:


#include <stdio.h>
static unsigned test_base64_str(const char *file, int line,
 const char *input, const char *expected)
{
 char *actual = hex_to_base64(input);
 if (!actual) {
 fprintf(stderr, "%s:%d: to_base64(\"%s\") returned null; expected \"%s\"\n",
 file, line, input, expected);
 return 1;
 }
 if (!strcmp(actual, expected)) {
 free(actual);
 return 0;
 }
 fprintf(stderr, "%s:%d: to_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
 file, line, input, actual, expected);
 free(actual);
 return 1;
}
#define TEST_BASE64_STR(n, expected) \
 failures += test_base64_str(__FILE__, __LINE__, n, expected)
int main(void)
{
 unsigned failures = 0;
 TEST_BASE64_STR("1d", "HQ=");
 TEST_BASE64_STR("1d0", "HQ");
 TEST_BASE64_STR("deadbeef", "3q2+7w=");
 TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
 "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
 printf("Tests completed with %u failures.\n", failures);
 return failures > 0;
}

Use standard library for hex → integer conversion

We could use standard-library sscanf() with a format string of %3x to read the groups of three hex digits instead of coding our own from_hex(). It can also tell us (using %n) how many characters were converted. That gives a simpler implementation:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *hex_to_base64(const char *hex_string)
{
 static const char base64[] =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
 + 2 - (input_len - 1) % 3 /* padding '=' or '==' */
 + 1; /* string terminator */
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return out_buf;
 }
 unsigned int digits;
 int d_len;
 char *out = out_buf;
 while (*hex_string) {
 if (sscanf(hex_string, "%3x%n", &digits, &d_len) != 1) {
 /* parse error */
 free(out_buf);
 return NULL;
 }
 switch (d_len) {
 case 3:
 *out++ = base64[digits >> 6];
 *out++ = base64[digits & 0x3f];
 break;
 case 2:
 digits <<= 4;
 *out++ = base64[digits >> 6];
 *out++ = base64[digits & 0x3f];
 *out++ = '=';
 break;
 case 1:
 *out++ = base64[digits];
 *out++ = '=';
 *out++ = '=';
 }
 hex_string += d_len;
 }
 *out++ = '0円';
 assert(out == out_buf + output_len);
 return out_buf;
}

Question 4

Thanks a lot! Your pointers are super helpful, and I am getting to learn some really good practices. Thanks again!

Question 5

I was going to review this, but you've covered exactly everything I'd have said. Good review!

Question 6

@Cole: Digits 0-9 are guaranteed consecutive: paragraph 5.2.1.3 says, "In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous."

Question 7

@Toby My bad. EBCDIC has the letters be non sequential, not numbers. That’s what I get for commenting when I’m tired...

Question 8

Syntax words used in functions

I myself don't like the use of a syntax word in a function name like return_base64(). This is due to the fact that some older code highlighting editors may change color of the part return. Despite the fact that it is in a function definition, many older editors are unable to distinguish it.

When declaring function names avoid using any syntax words. In a function declaration, there is just one syntax word which could be used, and that is switch. Maybe also do_****.

When the programmer is really tired after working long hours he/she will only see the return and thinks it is a return statement.

Standard names in headers used in functions

And don't use function names that are too close to function names in most of the standard header files. Especially the functions that begins with str___.

It is very easy to look at the code and think that a user defined str___ function as part of <string.h>. It will be tough for the future developer to implement functions as a result of this.

Efficient expressions

Besides that, the line:

uint8_t *hex_binary_array = malloc(sizeof(char) * size);

Should be written like this:

uint8_t *hex_binary_array = malloc(size * sizeof(char));

This is due to the fact that if there is a more extensive expression inside the parenthesis, older compilers might process it more effectively.

Question 9

Allocations are actually best written with the sizeof expression first, to avoid integer overflow. For example, given unsigned rows, unsigned cols;, then int *array = malloc(rows * cols * sizeof *array); is computed with int rows * cols before promoting to size_t. Whereas int *array = malloc(sizeof *array * rows * cols); all the multiplications are size_t, reducing overflow risk.

Question 10

I agree that using return in a function name is poor - my view is that it tells us something we already know, so just acts as clutter. And that's with a capable editor that doesn't screw up syntax highlighting. A lot of authors seem to like a get_ prefix, which is similarly useless.

Question 11

Technically, the standard only reserves names that begin with str followed by a letter, but I agree that str_ is confusingly close, particularly for those (a majority, probably) that haven't memorised the exact rules!

Question 12

@Sven: More efficient how? Do you have an example of inefficient asm that an old compiler makes for some ISA? As Toby points out, there's a potential correctness difference; is your potential speedup only from assuming an int * int won't overflow an int result? (i.e. letting an x86-64 compiler use imul ecx, edx to produce a 32-bit result zero-extended to 64, instead of having to potentially sign-extend int function args to 64-bit with movsxd, and then imul rdx, rax or something? Some older CPUs have slower 64-bit imul, and it may cost extra movsxd instructions. But can overflow

Question 13

So TL:DR: are you recommending trading correctness with large sizes for a tiny bit of code-size and speed? If so, say so. If not, please explain this surprising claim.

Toby Speight 89k14 gold badges104 silver badges327 bronze badges · Answer 1 · 2022-01-22 12:06:43Z

Headers

I don't see any need for <string.h>. We can drop <stdbool.h> if we rewrite check_not_equal:

#define check_not_equal(a, b) ((a) != (b))

Personally, I wouldn't use a macro for that - it just obfuscates the code.

<stdio.h> is used only by the tests. We may compile them separately if we want to use the function in other programs.

And we can eliminate the need for <math.h> using a well-known technique for integer ceiling divide:

int ceil_div(int n, int m) {
 /* assumes n+m-1 doesn't overflow */
 /* A bulletproof version is only slightly more complex. */
 return (n+m-1) / m;
}

That's likely more efficient than using floating-point, too.

Macros

Don't use macros where functions would work equally well or better. In particular, we have lots of macros here that expand their arguments multiple times, which can cause surprise when they have side-effects.

Always use all-caps names for macros, as a warning to the reader that they are not functions.

Using functions is the right thing here - any decent compiler will inline for you when that produces better code.

And what's the use of the ternary expression like (((a < 52) ? 1 : 0) == 1) ? That's simply (a < 52).

It's simpler, clearer and more efficient just to index a lookup table for each of those:

/* a in "0123456789abcdef" */
static int from_hex(char c)
{
 switch (c) {
 default: return c - '0'; /* digits are guaranteed consecutive */*/
 case 'a': case 'A': return 10;
 case 'b': case 'B': return 11;
 case 'c': case 'C': return 12;
 case 'd': case 'D': return 13;
 case 'e': case 'E': return 14;
 case 'f': case 'F': return 15;
 }
}
/* 0 <= a <= 63 */
static char to_base64(int a)
{
 return "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/+"[a];
}

Or just declare the strings, and let the code do the indexing.

Don't use fixed-size types unnecessarily

uint8_t isn't available everywhere (notably, whenever CHAR_BIT is greater than 8). But we don't exactly 8 bits - we can use uint8_fast_t or simply unsigned char for our purposes.

Character set assumptions

The code seems to assume that the output characters are ASCII-encoded (which includes Latin-1 and UTF-8, but not EBCDIC). Worse than that, it uses decimal code-point numbers (e.g. 65) where character constants ('a') would be easier to understand.

Don't assume allocation succeeds

malloc() can return a null pointer. Dereferencing a null pointer is undefined behaviour, so just report that and exit early; the usual way is to return a null pointer to mean "could not allocate":

char *base64 = malloc(n);
if (!base64) { return base64; }

Note that multiplying by sizeof (char) is pointless, since that's 1 by definition (sizeof yields the number of char needed to represent a type).

Control flow

This structure can be simplified:

 if (size++ != n) {
 c = (mask1 & hex_array[i]) >> 2;
 base64[temp * 4] = return_base64(c);
 } else
 break;

It's good practice to use braces {...} even for single-statement blocks, and especially to be consistent both sides of else. But we don't need else here if we invert the test:

 if (size++ == n) {
 break;
 }
 c = (mask1 & hex_array[i]) >> 2;
 base64[temp * 4] = return_base64(c);

Test program

It's good that we have tests. We can make it better by having self-checking tests instead of needing to inspect program output.

We could write that with a couple of helper functions:

#include <stdio.h>
static unsigned test_base64_char(const char *file, int line,
 int n, char expected)
{
 char actual = return_base64(n);
 if (actual == expected) {
 return 0;
 }
 fprintf(stderr, "%s:%d: return_base64(%d) returned '%c'; expected '%c'\n",
 file, line, n, actual, expected);
 return 1;
}
static unsigned test_base64_str(const char *file, int line,
 const char *input, const char *expected)
{
 char *actual = convert_to_base64(input);
 if (!actual) {
 fprintf(stderr, "%s:%d: return_base64(\"%s\") returned null; expected \"%s\"\n",
 file, line, input, expected);
 return 1;
 }
 if (!strcmp(actual, expected)) {
 free(actual);
 return 0;
 }
 fprintf(stderr, "%s:%d: return_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
 file, line, input, actual, expected);
 free(actual);
 return 1;
}

Some macros to add the source location of the tests:

#define TEST_BASE64_CHAR(n, expected) \
 failures += test_base64_char(__FILE__, __LINE__, n, expected)
#define TEST_BASE64_STR(n, expected) \
 failures += test_base64_str(__FILE__, __LINE__, n, expected)

And a main() function to run them:

int main(void)
{
 unsigned failures = 0;
 TEST_BASE64_CHAR(4, 'E');
 TEST_BASE64_CHAR(19, 'T');
 TEST_BASE64_CHAR(37, 'l');
 TEST_BASE64_CHAR(49, 'x');
 TEST_BASE64_CHAR(54, '2');
 TEST_BASE64_CHAR(59, '7');
 TEST_BASE64_CHAR(62, '+');
 TEST_BASE64_CHAR(63, '/');
 TEST_BASE64_STR("deadbeef", "3q2+7w");
 TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
 "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
 printf("Tests completed with %u failures.\n", failures);
 return failures > 0;
}

Temporary array

The code works in two phases, reflected in the two functions, with a helper array allocated to hold the binary representation of the hex string. But that helper is unnecessary - we should be able to work directly from the hex chars, converting each group of three into two base-64 chars.

/*
 Function to convert a hexadecimal-encoded string to a base64-encoded string.
 params : hexadecimal-encoded string
 returns : string - or a null pointer if allocation failed
The return value must be deallocated using free()
*/
char *convert_to_base64(const char *hex_string)
{
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 + 1;
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return NULL;
 }
 char *out = out_buf;
 while (hex_string[0] && hex_string[1] && hex_string[2]) {
 /* convert three hex digits to two base-64 chars */
 int digit[3] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1]),
 from_hex(hex_string[2])
 };
 *out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
 *out++ = to_base64(((digit[1] & 3) << 4) + (digit[2]));
 hex_string += 3;
 }
 /* Now deal with leftover chars */
 if (hex_string[0] && hex_string[1]) {
 /* convert three hex digits to two base-64 chars */
 int digit[2] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1])
 };
 *out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
 *out++ = to_base64(((digit[1] & 3) << 4));
 } else if (hex_string[0]) {
 /* convert three hex digits to two base-64 chars */
 int digit = from_hex(hex_string[0]);
 *out++ = to_base64(digit << 2);
 }
 *out = '0円';
 return out_buf;
}

Output format ambiguity

There's no indication of how many blank characters were encoded, unlike standard base-64 formats, which use one or two trailing = to indicate this. That will break round-trip of inputs such as "1d", which is indistinguishable from "1d0" after encoding (both produce `"HQ" as output).

Modified code

Header file hex_to_base64.h:

#ifndef HEX_TO_BASE64_H
#define HEX_TO_BASE64_H
/*
 Function to convert a hexadecimal-encoded string to a base64-encoded string.
 params : hexadecimal-encoded string
 returns : string - or a null pointer if allocation failed
 The return value must be deallocated using free().
*/
char *hex_to_base64(const char *hex_string);
#endif

Implementation:

#include "hex_to_base64.h"
#include <stdlib.h>
#include <string.h>
/* a in "0123456789abcdef" */
static int from_hex(char c)
{
 switch (c) {
 default: return c - '0'; /* digits are guaranteed consecutive */
 case 'a': case 'A': return 10;
 case 'b': case 'B': return 11;
 case 'c': case 'C': return 12;
 case 'd': case 'D': return 13;
 case 'e': case 'E': return 14;
 case 'f': case 'F': return 15;
 }
}
char *hex_to_base64(const char *hex_string)
{
 static const char base64[] =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
 + 2 - (input_len - 1) % 3 /* padding '=' or '==' */
 + 1; /* string terminator */
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return NULL;
 }
 char *out = out_buf;
 while (hex_string[0] && hex_string[1] && hex_string[2]) {
 /* convert three hex digits to two base-64 chars */
 int digit[3] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1]),
 from_hex(hex_string[2])
 };
 *out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
 *out++ = base64[((digit[1] & 3) << 4) + (digit[2])];
 hex_string += 3;
 }
 /* Now deal with leftover chars */
 if (hex_string[0] && hex_string[1]) {
 /* convert two hex digits to two base-64 chars */
 int digit[2] = {
 from_hex(hex_string[0]),
 from_hex(hex_string[1])
 };
 *out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
 *out++ = base64[((digit[1] & 3) << 4)];
 *out++ = '=';
 } else if (hex_string[0]) {
 /* convert one hex digit to one base-64 char */
 int digit = from_hex(hex_string[0]);
 *out++ = base64[digit << 2];
 *out++ = '=';
 *out++ = '=';
 }
 *out = '0円';
 return out_buf;
}

Tests:


#include <stdio.h>
static unsigned test_base64_str(const char *file, int line,
 const char *input, const char *expected)
{
 char *actual = hex_to_base64(input);
 if (!actual) {
 fprintf(stderr, "%s:%d: to_base64(\"%s\") returned null; expected \"%s\"\n",
 file, line, input, expected);
 return 1;
 }
 if (!strcmp(actual, expected)) {
 free(actual);
 return 0;
 }
 fprintf(stderr, "%s:%d: to_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
 file, line, input, actual, expected);
 free(actual);
 return 1;
}
#define TEST_BASE64_STR(n, expected) \
 failures += test_base64_str(__FILE__, __LINE__, n, expected)
int main(void)
{
 unsigned failures = 0;
 TEST_BASE64_STR("1d", "HQ=");
 TEST_BASE64_STR("1d0", "HQ");
 TEST_BASE64_STR("deadbeef", "3q2+7w=");
 TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
 "SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
 printf("Tests completed with %u failures.\n", failures);
 return failures > 0;
}

Use standard library for hex → integer conversion

We could use standard-library sscanf() with a format string of %3x to read the groups of three hex digits instead of coding our own from_hex(). It can also tell us (using %n) how many characters were converted. That gives a simpler implementation:

#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *hex_to_base64(const char *hex_string)
{
 static const char base64[] =
 "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
 size_t input_len = strlen(hex_string);
 size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
 + 2 - (input_len - 1) % 3 /* padding '=' or '==' */
 + 1; /* string terminator */
 char *const out_buf = malloc(output_len);
 if (!out_buf) {
 return out_buf;
 }
 unsigned int digits;
 int d_len;
 char *out = out_buf;
 while (*hex_string) {
 if (sscanf(hex_string, "%3x%n", &digits, &d_len) != 1) {
 /* parse error */
 free(out_buf);
 return NULL;
 }
 switch (d_len) {
 case 3:
 *out++ = base64[digits >> 6];
 *out++ = base64[digits & 0x3f];
 break;
 case 2:
 digits <<= 4;
 *out++ = base64[digits >> 6];
 *out++ = base64[digits & 0x3f];
 *out++ = '=';
 break;
 case 1:
 *out++ = base64[digits];
 *out++ = '=';
 *out++ = '=';
 }
 hex_string += d_len;
 }
 *out++ = '0円';
 assert(out == out_buf + output_len);
 return out_buf;
}

Thanks a lot! Your pointers are super helpful, and I am getting to learn some really good practices. Thanks again!
I was going to review this, but you've covered exactly everything I'd have said. Good review!
@Cole: Digits 0-9 are guaranteed consecutive: paragraph 5.2.1.3 says, "In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous."
@Toby My bad. EBCDIC has the letters be non sequential, not numbers. That’s what I get for commenting when I’m tired...

Sven B 111 bronze badge · Answer 2 · 2022-01-23 05:50:10Z

Syntax words used in functions

I myself don't like the use of a syntax word in a function name like return_base64(). This is due to the fact that some older code highlighting editors may change color of the part return. Despite the fact that it is in a function definition, many older editors are unable to distinguish it.

When declaring function names avoid using any syntax words. In a function declaration, there is just one syntax word which could be used, and that is switch. Maybe also do_****.

When the programmer is really tired after working long hours he/she will only see the return and thinks it is a return statement.

Standard names in headers used in functions

And don't use function names that are too close to function names in most of the standard header files. Especially the functions that begins with str___.

It is very easy to look at the code and think that a user defined str___ function as part of <string.h>. It will be tough for the future developer to implement functions as a result of this.

Efficient expressions

Besides that, the line:

uint8_t *hex_binary_array = malloc(sizeof(char) * size);

Should be written like this:

uint8_t *hex_binary_array = malloc(size * sizeof(char));

This is due to the fact that if there is a more extensive expression inside the parenthesis, older compilers might process it more effectively.

Allocations are actually best written with the sizeof expression first, to avoid integer overflow. For example, given unsigned rows, unsigned cols;, then int *array = malloc(rows * cols * sizeof *array); is computed with int rows * cols before promoting to size_t. Whereas int *array = malloc(sizeof *array * rows * cols); all the multiplications are size_t, reducing overflow risk.
I agree that using return in a function name is poor - my view is that it tells us something we already know, so just acts as clutter. And that's with a capable editor that doesn't screw up syntax highlighting. A lot of authors seem to like a get_ prefix, which is similarly useless.
Technically, the standard only reserves names that begin with str followed by a letter, but I agree that str_ is confusingly close, particularly for those (a majority, probably) that haven't memorised the exact rules!
@Sven: More efficient how? Do you have an example of inefficient asm that an old compiler makes for some ISA? As Toby points out, there's a potential correctness difference; is your potential speedup only from assuming an int * int won't overflow an int result? (i.e. letting an x86-64 compiler use imul ecx, edx to produce a 32-bit result zero-extended to 64, instead of having to potentially sign-extend int function args to 64-bit with movsxd, and then imul rdx, rax or something? Some older CPUs have slower 64-bit imul, and it may cost extra movsxd instructions. But can overflow
So TL:DR: are you recommending trading correctness with large sizes for a tiny bit of code-size and speed? If so, say so. If not, please explain this surprising claim.

Stack Exchange Network

C code to convert hexadecimal string to base64

2 Answers 2

Headers

Macros

Don't use fixed-size types unnecessarily

Character set assumptions

Don't assume allocation succeeds

Control flow

Test program

Temporary array

Output format ambiguity

Modified code

Use standard library for hex → integer conversion

Syntax words used in functions

Standard names in headers used in functions

Efficient expressions

You must log in to answer this question.

Hot Network Questions

C code to convert hexadecimal string to base64

2 Answers 2

Headers

Macros

Don't use fixed-size types unnecessarily

Character set assumptions

Don't assume allocation succeeds

Control flow

Test program

Temporary array

Output format ambiguity

Modified code

Use standard library for hex → integer conversion

Syntax words used in functions

Standard names in headers used in functions

Efficient expressions

You must log in to answer this question.

Related

Hot Network Questions