I've written a program to convert a hex-encoded string to base64, and here is my code. My main concerns are:
- Optimizations - Is my code sufficiently optimized and if any more optimization is possible.
- Memory leaks - Is there any scope for memory leaks.
- Readability
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <stdbool.h>
#include <string.h>
#include <math.h>
/*
Macros for :
1. Returning integer value of the equivalent hex digit.
2. Returning base64 digit for a decimal value.
3. Checks if two numbers are equal or not.
*/
#define return_hex(a) ((a >= 97) ? (a - 97 + 10) : (a - 48))
#define return_base64(a) (((a < 26) ? 1 : 0) == 1 ? (a + 65) : (((a < 52) ? 1 : 0) == 1) ? (a + 97 - 26) : (((a < 62) ? 1 : 0) == 1) ? (a + 48 - 52) : (((a == 62) ? 1 : 0) == 1) ? 43 : 47)
#define check_not_equal(a, b) ((a == b) ? false : true)
/*
Helper function to convert an 8 bit integer array to a base64 string.
params : unsigned 8 bit integer array, array size
returns : string
*/
char *convert_helper(uint8_t *hex_array, int n) {
uint8_t mask1 = 252, mask2 = 240, mask3 = 192, c;
char *base64 = malloc(sizeof(char) * n);
int size = 0, temp = 0;
for(int i = 0; i < n; i += 3, temp++) {
if(check_not_equal(size++, n)) {
c = (mask1 & hex_array[i]) >> 2;
base64[temp * 4] = return_base64(c);
}
else
break;
if(check_not_equal(size++, n)) {
c = (((255 - mask1) & hex_array[i]) << 4) | ((mask2 & hex_array[i + 1]) >> 4);
base64[temp * 4 + 1] = return_base64(c);
}
else
break;
if(check_not_equal(size++, n)) {
c = (((255 - mask2) & hex_array[i + 1]) << 2) | ((mask3 & hex_array[i + 2]) >> 6);
base64[temp * 4 + 2] = return_base64(c);
}
else
break;
if(check_not_equal(size++, n)) {
c = (255 - mask3) & hex_array[i + 2];
base64[temp * 4 + 3] = return_base64(c);
}
else
break;
}
return base64;
}
/*
Function to convert a hexadecimal-encoded string to a base64-encoded string.
params : hexadecimal-encoded string, string size
returns : string
*/
char *convert_to_base64(char *hex_string, int n) {
int size = ceil((n * 4) / 6.0);
uint8_t *hex_binary_array = malloc(sizeof(char) * size);
for(int i = 0; i < n; i += 2) {
if(i == n - 1) {
hex_binary_array[i / 2] = return_hex(hex_string[i]) << 4;
}
else {
uint8_t first = return_hex(hex_string[i]), second = return_hex(hex_string[i + 1]);
hex_binary_array[i / 2] = first << 4 | second;
}
}
char *base64_string = convert_helper(hex_binary_array, size);
free(hex_binary_array);
return base64_string;
}
int main()
{
/*
//Test for returning base64 values.
printf("%c", return_base64(4)); //returns E
printf("%c", return_base64(19)); //returns T
printf("%c", return_base64(37)); //returns l
printf("%c", return_base64(49)); //returns x
printf("%c", return_base64(54)); //returns 2
printf("%c", return_base64(59)); //returns 7
printf("%c", return_base64(62)); //returns +
printf("%c", return_base64(63)); //returns /
*/
char *string;
/*
//Test 1
string = convert_to_base64("deadbeef", 8);
printf("%s\n", string);
printf("%d", strcmp("3q2+7w", string));
free(string);
//Test 2
string = convert_to_base64("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d", 96);
printf("%s\n", string);
printf("%d", strcmp("SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t", string));
free(string);
*/
string = convert_to_base64("fd347a67d1e", 11);
printf("%s\n", string);
return 0;
}
2 Answers 2
Headers
I don't see any need for <string.h>
. We can drop <stdbool.h>
if we rewrite check_not_equal
:
#define check_not_equal(a, b) ((a) != (b))
Personally, I wouldn't use a macro for that - it just obfuscates the code.
<stdio.h>
is used only by the tests. We may compile them separately if we want to use the function in other programs.
And we can eliminate the need for <math.h>
using a well-known technique for integer ceiling divide:
int ceil_div(int n, int m) {
/* assumes n+m-1 doesn't overflow */
/* A bulletproof version is only slightly more complex. */
return (n+m-1) / m;
}
That's likely more efficient than using floating-point, too.
Macros
Don't use macros where functions would work equally well or better. In particular, we have lots of macros here that expand their arguments multiple times, which can cause surprise when they have side-effects.
Always use all-caps names for macros, as a warning to the reader that they are not functions.
Using functions is the right thing here - any decent compiler will inline for you when that produces better code.
And what's the use of the ternary expression like (((a < 52) ? 1 : 0) == 1)
? That's simply (a < 52)
.
It's simpler, clearer and more efficient just to index a lookup table for each of those:
/* a in "0123456789abcdef" */
static int from_hex(char c)
{
switch (c) {
default: return c - '0'; /* digits are guaranteed consecutive */*/
case 'a': case 'A': return 10;
case 'b': case 'B': return 11;
case 'c': case 'C': return 12;
case 'd': case 'D': return 13;
case 'e': case 'E': return 14;
case 'f': case 'F': return 15;
}
}
/* 0 <= a <= 63 */
static char to_base64(int a)
{
return "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789/+"[a];
}
Or just declare the strings, and let the code do the indexing.
Don't use fixed-size types unnecessarily
uint8_t
isn't available everywhere (notably, whenever CHAR_BIT
is greater than 8). But we don't exactly 8 bits - we can use uint8_fast_t
or simply unsigned char
for our purposes.
Character set assumptions
The code seems to assume that the output characters are ASCII-encoded (which includes Latin-1 and UTF-8, but not EBCDIC). Worse than that, it uses decimal code-point numbers (e.g. 65
) where character constants ('a'
) would be easier to understand.
Don't assume allocation succeeds
malloc()
can return a null pointer. Dereferencing a null pointer is undefined behaviour, so just report that and exit early; the usual way is to return a null pointer to mean "could not allocate":
char *base64 = malloc(n);
if (!base64) { return base64; }
Note that multiplying by sizeof (char)
is pointless, since that's 1 by definition (sizeof
yields the number of char
needed to represent a type).
Control flow
This structure can be simplified:
if (size++ != n) { c = (mask1 & hex_array[i]) >> 2; base64[temp * 4] = return_base64(c); } else break;
It's good practice to use braces {
...}
even for single-statement blocks, and especially to be consistent both sides of else
. But we don't need else
here if we invert the test:
if (size++ == n) {
break;
}
c = (mask1 & hex_array[i]) >> 2;
base64[temp * 4] = return_base64(c);
Test program
It's good that we have tests. We can make it better by having self-checking tests instead of needing to inspect program output.
We could write that with a couple of helper functions:
#include <stdio.h>
static unsigned test_base64_char(const char *file, int line,
int n, char expected)
{
char actual = return_base64(n);
if (actual == expected) {
return 0;
}
fprintf(stderr, "%s:%d: return_base64(%d) returned '%c'; expected '%c'\n",
file, line, n, actual, expected);
return 1;
}
static unsigned test_base64_str(const char *file, int line,
const char *input, const char *expected)
{
char *actual = convert_to_base64(input);
if (!actual) {
fprintf(stderr, "%s:%d: return_base64(\"%s\") returned null; expected \"%s\"\n",
file, line, input, expected);
return 1;
}
if (!strcmp(actual, expected)) {
free(actual);
return 0;
}
fprintf(stderr, "%s:%d: return_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
file, line, input, actual, expected);
free(actual);
return 1;
}
Some macros to add the source location of the tests:
#define TEST_BASE64_CHAR(n, expected) \
failures += test_base64_char(__FILE__, __LINE__, n, expected)
#define TEST_BASE64_STR(n, expected) \
failures += test_base64_str(__FILE__, __LINE__, n, expected)
And a main()
function to run them:
int main(void)
{
unsigned failures = 0;
TEST_BASE64_CHAR(4, 'E');
TEST_BASE64_CHAR(19, 'T');
TEST_BASE64_CHAR(37, 'l');
TEST_BASE64_CHAR(49, 'x');
TEST_BASE64_CHAR(54, '2');
TEST_BASE64_CHAR(59, '7');
TEST_BASE64_CHAR(62, '+');
TEST_BASE64_CHAR(63, '/');
TEST_BASE64_STR("deadbeef", "3q2+7w");
TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
"SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
printf("Tests completed with %u failures.\n", failures);
return failures > 0;
}
Temporary array
The code works in two phases, reflected in the two functions, with a helper array allocated to hold the binary representation of the hex string. But that helper is unnecessary - we should be able to work directly from the hex chars, converting each group of three into two base-64 chars.
/*
Function to convert a hexadecimal-encoded string to a base64-encoded string.
params : hexadecimal-encoded string
returns : string - or a null pointer if allocation failed
The return value must be deallocated using free()
*/
char *convert_to_base64(const char *hex_string)
{
size_t input_len = strlen(hex_string);
size_t output_len = (input_len * 2 + 2) / 3 + 1;
char *const out_buf = malloc(output_len);
if (!out_buf) {
return NULL;
}
char *out = out_buf;
while (hex_string[0] && hex_string[1] && hex_string[2]) {
/* convert three hex digits to two base-64 chars */
int digit[3] = {
from_hex(hex_string[0]),
from_hex(hex_string[1]),
from_hex(hex_string[2])
};
*out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
*out++ = to_base64(((digit[1] & 3) << 4) + (digit[2]));
hex_string += 3;
}
/* Now deal with leftover chars */
if (hex_string[0] && hex_string[1]) {
/* convert three hex digits to two base-64 chars */
int digit[2] = {
from_hex(hex_string[0]),
from_hex(hex_string[1])
};
*out++ = to_base64((digit[0] << 2) + (digit[1] >> 2));
*out++ = to_base64(((digit[1] & 3) << 4));
} else if (hex_string[0]) {
/* convert three hex digits to two base-64 chars */
int digit = from_hex(hex_string[0]);
*out++ = to_base64(digit << 2);
}
*out = '0円';
return out_buf;
}
Output format ambiguity
There's no indication of how many blank characters were encoded, unlike standard base-64 formats, which use one or two trailing =
to indicate this. That will break round-trip of inputs such as "1d"
, which is indistinguishable from "1d0"
after encoding (both produce `"HQ" as output).
Modified code
Header file hex_to_base64.h
:
#ifndef HEX_TO_BASE64_H
#define HEX_TO_BASE64_H
/*
Function to convert a hexadecimal-encoded string to a base64-encoded string.
params : hexadecimal-encoded string
returns : string - or a null pointer if allocation failed
The return value must be deallocated using free().
*/
char *hex_to_base64(const char *hex_string);
#endif
Implementation:
#include "hex_to_base64.h"
#include <stdlib.h>
#include <string.h>
/* a in "0123456789abcdef" */
static int from_hex(char c)
{
switch (c) {
default: return c - '0'; /* digits are guaranteed consecutive */
case 'a': case 'A': return 10;
case 'b': case 'B': return 11;
case 'c': case 'C': return 12;
case 'd': case 'D': return 13;
case 'e': case 'E': return 14;
case 'f': case 'F': return 15;
}
}
char *hex_to_base64(const char *hex_string)
{
static const char base64[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
size_t input_len = strlen(hex_string);
size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
+ 2 - (input_len - 1) % 3 /* padding '=' or '==' */
+ 1; /* string terminator */
char *const out_buf = malloc(output_len);
if (!out_buf) {
return NULL;
}
char *out = out_buf;
while (hex_string[0] && hex_string[1] && hex_string[2]) {
/* convert three hex digits to two base-64 chars */
int digit[3] = {
from_hex(hex_string[0]),
from_hex(hex_string[1]),
from_hex(hex_string[2])
};
*out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
*out++ = base64[((digit[1] & 3) << 4) + (digit[2])];
hex_string += 3;
}
/* Now deal with leftover chars */
if (hex_string[0] && hex_string[1]) {
/* convert two hex digits to two base-64 chars */
int digit[2] = {
from_hex(hex_string[0]),
from_hex(hex_string[1])
};
*out++ = base64[(digit[0] << 2) + (digit[1] >> 2)];
*out++ = base64[((digit[1] & 3) << 4)];
*out++ = '=';
} else if (hex_string[0]) {
/* convert one hex digit to one base-64 char */
int digit = from_hex(hex_string[0]);
*out++ = base64[digit << 2];
*out++ = '=';
*out++ = '=';
}
*out = '0円';
return out_buf;
}
Tests:
#include <stdio.h>
static unsigned test_base64_str(const char *file, int line,
const char *input, const char *expected)
{
char *actual = hex_to_base64(input);
if (!actual) {
fprintf(stderr, "%s:%d: to_base64(\"%s\") returned null; expected \"%s\"\n",
file, line, input, expected);
return 1;
}
if (!strcmp(actual, expected)) {
free(actual);
return 0;
}
fprintf(stderr, "%s:%d: to_base64(\"%s\") returned \"%s\"; expected \"%s\"\n",
file, line, input, actual, expected);
free(actual);
return 1;
}
#define TEST_BASE64_STR(n, expected) \
failures += test_base64_str(__FILE__, __LINE__, n, expected)
int main(void)
{
unsigned failures = 0;
TEST_BASE64_STR("1d", "HQ=");
TEST_BASE64_STR("1d0", "HQ");
TEST_BASE64_STR("deadbeef", "3q2+7w=");
TEST_BASE64_STR("49276d206b696c6c696e6720796f757220627261696e206c696b65206120706f69736f6e6f7573206d757368726f6f6d",
"SSdtIGtpbGxpbmcgeW91ciBicmFpbiBsaWtlIGEgcG9pc29ub3VzIG11c2hyb29t");
printf("Tests completed with %u failures.\n", failures);
return failures > 0;
}
Use standard library for hex → integer conversion
We could use standard-library sscanf()
with a format string of %3x
to read the groups of three hex digits instead of coding our own from_hex()
. It can also tell us (using %n
) how many characters were converted. That gives a simpler implementation:
#include <assert.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char *hex_to_base64(const char *hex_string)
{
static const char base64[] =
"ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789+/";
size_t input_len = strlen(hex_string);
size_t output_len = (input_len * 2 + 2) / 3 /* base64 chars */
+ 2 - (input_len - 1) % 3 /* padding '=' or '==' */
+ 1; /* string terminator */
char *const out_buf = malloc(output_len);
if (!out_buf) {
return out_buf;
}
unsigned int digits;
int d_len;
char *out = out_buf;
while (*hex_string) {
if (sscanf(hex_string, "%3x%n", &digits, &d_len) != 1) {
/* parse error */
free(out_buf);
return NULL;
}
switch (d_len) {
case 3:
*out++ = base64[digits >> 6];
*out++ = base64[digits & 0x3f];
break;
case 2:
digits <<= 4;
*out++ = base64[digits >> 6];
*out++ = base64[digits & 0x3f];
*out++ = '=';
break;
case 1:
*out++ = base64[digits];
*out++ = '=';
*out++ = '=';
}
hex_string += d_len;
}
*out++ = '0円';
assert(out == out_buf + output_len);
return out_buf;
}
-
\$\begingroup\$ Thanks a lot! Your pointers are super helpful, and I am getting to learn some really good practices. Thanks again! \$\endgroup\$Vedant Jadhav– Vedant Jadhav2022年01月22日 13:50:17 +00:00Commented Jan 22, 2022 at 13:50
-
1\$\begingroup\$ I was going to review this, but you've covered exactly everything I'd have said. Good review! \$\endgroup\$Edward– Edward2022年01月22日 14:54:49 +00:00Commented Jan 22, 2022 at 14:54
-
\$\begingroup\$ @Cole: Digits 0-9 are guaranteed consecutive: paragraph 5.2.1.3 says, "In both the source and execution basic character sets, the value of each character after 0 in the above list of decimal digits shall be one greater than the value of the previous." \$\endgroup\$Toby Speight– Toby Speight2022年01月26日 08:39:42 +00:00Commented Jan 26, 2022 at 8:39
-
\$\begingroup\$ @Toby My bad. EBCDIC has the letters be non sequential, not numbers. That’s what I get for commenting when I’m tired... \$\endgroup\$Cole Tobin– Cole Tobin2022年01月26日 12:13:01 +00:00Commented Jan 26, 2022 at 12:13
Syntax words used in functions
I myself don't like the use of a syntax word in a function name like
return_base64()
. This is due to the fact that some older code highlighting editors may change color of the part return
.
Despite the fact that it is in a function definition, many older editors are unable to distinguish it.
When declaring function names avoid using any syntax words.
In a function declaration, there is just one syntax word which could be used, and that is switch
. Maybe also do_****
.
When the programmer is really tired after working long hours he/she will only see the return and thinks it is a return statement.
Standard names in headers used in functions
And don't use function names that are too close to function names in most of the standard header files. Especially the functions that begins with str___
.
It is very easy to look at the code and think that a user defined str___
function as part of <string.h>
. It will be tough for the future developer to implement functions as a result of this.
Efficient expressions
Besides that, the line:
uint8_t *hex_binary_array = malloc(sizeof(char) * size);
Should be written like this:
uint8_t *hex_binary_array = malloc(size * sizeof(char));
This is due to the fact that if there is a more extensive expression inside the parenthesis, older compilers might process it more effectively.
-
2\$\begingroup\$ Allocations are actually best written with the
sizeof
expression first, to avoid integer overflow. For example, givenunsigned rows, unsigned cols;
, thenint *array = malloc(rows * cols * sizeof *array);
is computed withint rows * cols
before promoting tosize_t
. Whereasint *array = malloc(sizeof *array * rows * cols);
all the multiplications aresize_t
, reducing overflow risk. \$\endgroup\$Toby Speight– Toby Speight2022年01月23日 10:21:16 +00:00Commented Jan 23, 2022 at 10:21 -
1\$\begingroup\$ I agree that using
return
in a function name is poor - my view is that it tells us something we already know, so just acts as clutter. And that's with a capable editor that doesn't screw up syntax highlighting. A lot of authors seem to like aget_
prefix, which is similarly useless. \$\endgroup\$Toby Speight– Toby Speight2022年01月23日 10:23:37 +00:00Commented Jan 23, 2022 at 10:23 -
1\$\begingroup\$ Technically, the standard only reserves names that begin with
str
followed by a letter, but I agree thatstr_
is confusingly close, particularly for those (a majority, probably) that haven't memorised the exact rules! \$\endgroup\$Toby Speight– Toby Speight2022年01月23日 10:28:42 +00:00Commented Jan 23, 2022 at 10:28 -
1\$\begingroup\$ @Sven: More efficient how? Do you have an example of inefficient asm that an old compiler makes for some ISA? As Toby points out, there's a potential correctness difference; is your potential speedup only from assuming an
int * int
won't overflow anint
result? (i.e. letting an x86-64 compiler useimul ecx, edx
to produce a 32-bit result zero-extended to 64, instead of having to potentially sign-extendint
function args to 64-bit withmovsxd
, and thenimul rdx, rax
or something? Some older CPUs have slower 64-bit imul, and it may cost extra movsxd instructions. But can overflow \$\endgroup\$Peter Cordes– Peter Cordes2022年01月23日 20:28:43 +00:00Commented Jan 23, 2022 at 20:28 -
2\$\begingroup\$ So TL:DR: are you recommending trading correctness with large sizes for a tiny bit of code-size and speed? If so, say so. If not, please explain this surprising claim. \$\endgroup\$Peter Cordes– Peter Cordes2022年01月23日 20:29:59 +00:00Commented Jan 23, 2022 at 20:29
'a'
not 97. Perhaps email me to[email protected]
for more \$\endgroup\$