Implementing a DNA codon table in C

Question 1

I decided to redo the first program I wrote (after the hello world thing). So, what it does is to convert a DNA sequence string to an array of strings that represent the amino acids that are made in a cell by ribosomes when that DNA sequence gets used (DNA codon table).

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char* codon_table[4][4][4] =
{
 {
 {"Phe", "Phe", "Len", "Len"},
 {"Ser", "Ser", "Ser", "Ser"},
 {"Tyr", "Tyr", "Stop", "Stop"},
 {"Cys", "Cys", "Stop", "Trp"}
 },
 {
 {"Len", "Len", "Len", "Len"},
 {"Pro", "Pro", "Pro", "Pro"},
 {"His", "His", "Gln", "Gln"},
 {"Arg", "Arg", "Arg", "Arg"}
 },
 {
 {"Ile", "Ile", "Ile", "Met"},
 {"Thr", "Thr", "Thr", "Thr"},
 {"Asn", "Asn", "Lys", "Lys"},
 {"Ser", "Ser", "Arg", "Arg"}
 },
 {
 {"Val", "Val", "Val", "Val"},
 {"Ala", "Ala", "Ala", "Ala"},
 {"Asp", "Asp", "Glu", "Glu"},
 {"Gly", "Gly", "Gly", "Gly"}
 }
};
int check_nucleobase_string(char* nucleobase_string)
{
 if(nucleobase_string==NULL)
 {
 return 1;
 }
 for(size_t i = 0; nucleobase_string[i] != '0円'; i+=1)
 {
 if(
 nucleobase_string[i] != 't' &&
 nucleobase_string[i] != 'c' &&
 nucleobase_string[i] != 'a' &&
 nucleobase_string[i] != 'g'
 )
 {
 return 0;
 }
 }
 return 1;
}
size_t nucleobase_to_aminoacid(char* nucleobase_string, char*** aminoacid_string)
{
 if(!check_nucleobase_string(nucleobase_string))
 {
 printf("Erroneous input: %s\n", nucleobase_string);
 return 0;
 }
 size_t nucleobase_count ;
 for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1);
 size_t aminoacid_count = nucleobase_count/3;
 (*aminoacid_string) = malloc(aminoacid_count);
 for(size_t i = 0, slow_i = 0; slow_i < aminoacid_count; i+=3, slow_i+=1)
 {
 (*aminoacid_string)[slow_i] =
 codon_table[
 nucleobase_string[i] == 't' ? 0 :
 nucleobase_string[i] == 'c' ? 1 :
 nucleobase_string[i] == 'a' ? 2 :
 nucleobase_string[i] == 'g' ? 3 : -1
 ][
 nucleobase_string[i+1] == 't' ? 0 :
 nucleobase_string[i+1] == 'c' ? 1 :
 nucleobase_string[i+1] == 'a' ? 2 :
 nucleobase_string[i+1] == 'g' ? 3 : -1
 ][
 nucleobase_string[i+2] == 't' ? 0 :
 nucleobase_string[i+2] == 'c' ? 1 :
 nucleobase_string[i+2] == 'a' ? 2 :
 nucleobase_string[i+2] == 'g' ? 3 : -1
 ];
 }
 return aminoacid_count;
}
int main()
{
 printf("Enter DNA sequence in 5' → 3' direction.\n");
 char* input_string = NULL;
 size_t n = 0;
 ssize_t line_length = getline(&input_string, &n, stdin);
 if(line_length == -1)
 {
 printf("Failed to read a line.\n");
 return 0;
 }
 char* nucleobase_string = malloc(line_length);
 memcpy(nucleobase_string, input_string, line_length-1);
 nucleobase_string[line_length-1] = '0円';
 char** aminoacid_string = NULL;
 size_t aminoacid_count = nucleobase_to_aminoacid(nucleobase_string, &aminoacid_string);
 for(size_t i = 0; i < aminoacid_count; i+=1)
 {
 if(i!=0)
 {
 printf(" + ");
 }
 printf("%s", aminoacid_string[i]);
 }
 printf("\n");
 return 0;
};

Question 2

An array of char* is not a great data structure for storing AA or codon sequence data. A pointer takes 4 or 8 bytes per codon, but there are only 64 possible codons so you could easily store one codon per byte as a uint8_t. And with only about 22 amino acids, that's close to being able to fit 2 AAs per byte, but it doesn't quite work and it's definitely easier to still use a uint8_t.

Then when you want to print, use that integer as an index into a table of strings.

Or if all you want to do is print out a string, then don't construct an array of strings at all, and just printf as you convert. Or memcpy 4-byte strings like "Phe " into a buffer to construct a string yourself instead of calling printf separately for each AA name.

(Compilers are good at 4-byte copies, because that's the size of an integer register. Especially from constant data, because they can inline the data as an immediate if that's optimal. So make your codon table static const)

slow_i is an odd variable name. Use aa_idx or something else meaningful, or maybe j. Or use one i variable and use i*3 + 0, i*3 + 1, and i*3 + 2 to let the compiler figure out what it wants to do.

This very long line: for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1); should be nucleobase_count = strlen(nucleobase_string);. Use library functions instead of re-inventing the wheel in a hard-to-read long line. The libc string functions usually have hand-optimized asm implementations that will be fast especially for long strings.

Or even better, have your nucleotide checking function return a count if it doesn't find problem! You're already looping over the whole string; you don't need to do it again to count the length.

Or even better than that, have your caller pass you the length so you can allocate memory right away and combine checking with converting. You could have the caller pass you an already-allocated output buffer as well as the length, so it has the option of using a C99 variable-length-array, reusing a buffer it already has allocated, or whatever, instead of baking malloc into your conversion routine.

 codon_table[ // your code
 nucleobase_string[i] == 't' ? 0 :
 nucleobase_string[i] == 'c' ? 1 :
 nucleobase_string[i] == 'a' ? 2 :
 nucleobase_string[i] == 'g' ? 3 : -1
 ][...][...]

If a bad nucleotide did slip through the cracks and make it to this code, you'd have undefined behaviour from reading codon_table[-1][0][0]; out of bounds access. You could remove the == g part and just have 3 be the last step in the ternary operator, with a comment to indicate that it's g if it wasn't one of the other 3. Also, that block of ternary operators should go in a tiny helper function like int base_to_idx(char base); instead of being repeated 3 times.

Then you can re-implement it with a table lookup or a perfect hash function. e.g.

// probably not faster than a table lookup, unless you make a SIMD version
/* 
 'a' 0b1100001 -> 0b00
 'c' 0b1100011 -> 0b10
 'g' 0b1100111 -> 0b01
 't' 0b1110100 -> 0b11
*/ 
unsigned nuc_to_idx(unsigned base) {
 if (base == 'a' || base == 'c' || base == 'g' || base == 't')
 return (base*3 & 0x3<<2)>>2;
 return -128U;
}

or a table lookup version that does a whole codon and returns an integer code. I used C designated-initializer syntax, plus a GNU extension to for ranges to make the table static const. You could also init it at runtime with memset(table, -1, UCHAR_MAX); and then assign the 4 valid values.

#include <stdint.h>
#include <limits.h>
// int8_t sign extends to the width of int / unsigned int
// so all the high bits get set before treating as unsigned (assuming 2's complement), for BAD_NUC inputs.
// But we don't depend on that: we still get an unsigned value > 64
// on sign/magnitude or one's complement machines if any of the 3 chars is bogus.
#define BAD_NUC -128
static const int8_t nuc_table[UCHAR_MAX+1] = {
 [0 ... 255] = BAD_NUC, // ranges are a GNU extension
 // last init takes precedence https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html
 ['a'] = 0,
 ['c'] = 1,
 ['g'] = 2,
 ['t'] = 3,
};
// Fun fact: doesn't depend on ASCII encoding because of the designated-initializers.
// unsigned char* so high-ASCII -> 128..255, not negative,
// and works as an index into a 256 entry table
unsigned codon_to_idx_LUT(unsigned char *p) {
 unsigned idx = nuc_table[p[0]];
 idx = idx*4 + nuc_table[p[1]];
 idx = idx*4 + nuc_table[p[2]];
 return idx;
 // 0..63 for in-range values, otherwise has higher bits set
 // so you only have to range-check once per codon, speeding up the common case of no error.
 // to map to AA codes, maybe use a lookup table on this 0..63 value
}

The table version compiles really efficiently on x86 with gcc -O3 (on the Godbolt compiler explorer), or hopefully with any sane compiler. (The if condition in the perfect-hash version does compile to an immediate bitmap, which is neat, but it still has to test/branch multiple times for every character instead of just doing 2 loads.)

codon_to_idx_LUT: # gcc7.3 -O3 for x86-64 System V calling convention: pointer arg in RDI
 movzx eax, BYTE PTR [rdi]
 movsx edx, BYTE PTR nuc_table[rax]
 movzx eax, BYTE PTR [rdi+1]
 movsx eax, BYTE PTR nuc_table[rax]
 lea edx, [rax+rdx*4] # this is why I wrote it as idx*4 + nuc_table[...] instead of some other way of multiplying and adding the results.
 movzx eax, BYTE PTR [rdi+2]
 movsx eax, BYTE PTR nuc_table[rax]
 lea eax, [rax+rdx*4]
 ret

This would run even faster if the compiler did one wider load and unpacked it with ALU instructions instead of loading each string byte separately. Running this in a loop is going to bottleneck on 2 loads per clock on mainstream Intel/AMD CPUs, so about 1 nucleotide per clock cycle, which is not bad. The overhead of checking the index for in-range (to detect bogus nucleotide characters) should be negligible compared to the string loads + table lookups. Especially on CPUs that can only do one load per clock.

(Related this Q&A for more about tweaking your C source to hand-hold the compiler into making better asm. You could do that here, but only at the cost of readability, I think. You'd probably have to do a dodgy pointer-cast or a safe memcpy and read 4 bytes as a uint32_t and unpack that.)

You can use this 0..63 codon as an index into a table of strings for printing. (Just flatten your current 3D array, although I didn't make any effort to be consistent with which numbers go with which DNA base in the hash or table vs. your version.)

Question 3

I was going to write a very similar answer. Thanks for saving me time.

Question 4

A clear example of using BAD_NUC is not shown (set, but not read). Saving it in a signed int8_t and unsigned unsigned could pose concerns. IMO, a positive value like #define BAD_NUC 127 is less problematic. The idea is good, but not sufficient fleshed out.

Question 5

@chux: Yeah, I didn't spend as much time as I could have on that part of the answer. I was personally more interested in something that compiled well with gcc on x86, where I know how an int8_t will sign-extend. It's very convoluted, but I think it's safe on any C implementation (sign/magnitude or one's complement), but I might be depending on an assumption about where the sign bits are when casting to unsigned, especially if it's not a type-pun. I kinda wanted something with the high bit set in the final result on out-of-range because that's slightly cheaper for a compiler to check :P

Question 6

@chux: TL:DR that part isn't great well-thought-out code-review advice, it's more like leftovers from what I was playing around with.

Question 7

@chux: heh, good point about int8_t requiring 2's complement, I forgot I was getting that guarantee. I used it instead of signed char because I wanted it to fail to compile instead of creating a gigantic initialized array, on systems like DSPs where char is a 32-bit type. (So there's probably no int8_t.)

Question 8

The line

 (*aminoacid_string) = malloc(aminoacid_count);

allocates aminoacid_count of bytes. The code needs that many pointers. The correct way is to

 *aminoacid_string = malloc(aminoacid_count * sizeof(**aminoacid_string));

Testing for the correct input (i.e. calling check_nucleobase_string) belongs to main.
I recommend to translate the nucleobase string to a numeric array (mapping t to 0 etc), and avoid ugly cascade of ternary operators.

The loop

 for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1);

is a very long (and inefficient) way to say

 nucleobase_count = strlen(nucleobase_string);

I don't see the reason to
```
 memcpy(nucleobase_string, input_string, line_length-1);
```
You may work directly with input_string.

Question 9

@1201ProgramAlarm It is null terminated. The ` = '0円';` assignment kills the newline, and it can be done directly on input_string. No copy is necessary.

Question 10

I did not want to work directly with input_string because it also contains the delimiter from getline (\n).

Question 11

@DariusDuesentrieb: but you don't need to preserve that. You can just modify input_string and discard the input line endings or lack thereof. If you ever want to print it out again, you can use printf("%s\n", input_string); or whatever to normalize the output to have every line end with a newline.

Question 12

Is malloc(foo * sizeof(bar)) really the correct way? Not calloc(foo, sizeof(bar))?

Question 13

@PeterTaylor Depends on the intent. If the next thing you do is to fill it up with data, there is no reason to spend time initializing it with zeroes.

Question 14

You should try to refactor the code to have less hard-coded constants. E.g. nucleobase_to_aminoacid has both tcag and codon_table hard-coded. That is in general something that hinders re-use.

You should preferably have a separate function to print the codon_table - in a standard format. This would have detected that you used "Len" instead of "Leu" as abbreviation of "Leucine".
If the decoding of tcag was a separate function - or an input - you could re-use the same function for mRNA.
If codon_table was input to nucleobase_to_aminoacid you could re-use the same code for the odd cases with different codon_table.

Question 15

In addition to current (and future) answers:

You use malloc(), but from what I can see, you do not free() it later on. In my case valgrind complains that 125 bytes in 2 blocks are "definitely lost" - meaning you've got memory leaks.
It is a common notation to use ++i instead of i += 1, especially in for loops.
Use the const qualifier where applicable. While the compiler may do such optimization for you (although not as often as you would think), it is useful to indicate that something will not be changed.

For example, char *codon_table can be char const *const codon_table. If you use the -Wconversion flag when compiling with GCC, you can catch all sorts of errors to do with type conversions as well.

Question 16

Agree with the value of the first const in char const *const codon_table. The 2nd const is unnecessary as char* codon_table[4][4][4] is an array and arrays cannot be assigned.

Question 17

Not much left to review given the other good answers. Some additional ideas:

Lopping off a `'\n'`

Certainly OP meant to lop off the usual trailing '\n'. The below lops off the last character from a line of input, regardless if it is a '\n' or not.

nucleobase_string[line_length-1] = '0円'; // May lop off something else.

Consider the end of piped user input may not end with a '\n', so although a trailing '\n' is very common, it is not a cerrtainty. Recommend to test for '\n'.

if (nucleobase_string[line_length-1] == '\n') {
 nucleobase_string[line_length-1] = '0円';
}
// or the one liner (slower)
nucleobase_string[strcspn(nucleobase_string, "\n")] = '0円';

House keeping

Good coding practice would free input_string in the end. Of course, such allocations are implicitly free'd when the program ends. Yet with an explicit free(), someone re-using this portion of code would benefit.

free(input_string);
return 0;

Question 18

Use of `stderr`

In several parts of your program you use printf(...) to:

ask user to provide input
output error messages
output result

printf function writes to stdout. You might consider using fprintf(stderr, ...) for asking for user input and error log.

Exit status

Also it is common to return nonzero status on program error calling exit(EXIT_FAILURE); or return EXIT_FAILURE; from main().

This way one may use your program like:

#!/bin/sh
result="$(echo 'whatever' | ./your_program 2>/dev/null)"
if [ $? -ne 0 ]; then
 # failed
fi

Question 19

@chux, wow, I was editing my answer because of this, when you replied (: Thanks anyway

Peter Cordes Peter Cordes 3,75118 silver badges28 bronze badges · Accepted Answer · 2018-04-11 10:10:51Z