I decided to redo the first program I wrote (after the hello world thing). So, what it does is to convert a DNA sequence string to an array of strings that represent the amino acids that are made in a cell by ribosomes when that DNA sequence gets used (DNA codon table).
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
char* codon_table[4][4][4] =
{
{
{"Phe", "Phe", "Len", "Len"},
{"Ser", "Ser", "Ser", "Ser"},
{"Tyr", "Tyr", "Stop", "Stop"},
{"Cys", "Cys", "Stop", "Trp"}
},
{
{"Len", "Len", "Len", "Len"},
{"Pro", "Pro", "Pro", "Pro"},
{"His", "His", "Gln", "Gln"},
{"Arg", "Arg", "Arg", "Arg"}
},
{
{"Ile", "Ile", "Ile", "Met"},
{"Thr", "Thr", "Thr", "Thr"},
{"Asn", "Asn", "Lys", "Lys"},
{"Ser", "Ser", "Arg", "Arg"}
},
{
{"Val", "Val", "Val", "Val"},
{"Ala", "Ala", "Ala", "Ala"},
{"Asp", "Asp", "Glu", "Glu"},
{"Gly", "Gly", "Gly", "Gly"}
}
};
int check_nucleobase_string(char* nucleobase_string)
{
if(nucleobase_string==NULL)
{
return 1;
}
for(size_t i = 0; nucleobase_string[i] != '0円'; i+=1)
{
if(
nucleobase_string[i] != 't' &&
nucleobase_string[i] != 'c' &&
nucleobase_string[i] != 'a' &&
nucleobase_string[i] != 'g'
)
{
return 0;
}
}
return 1;
}
size_t nucleobase_to_aminoacid(char* nucleobase_string, char*** aminoacid_string)
{
if(!check_nucleobase_string(nucleobase_string))
{
printf("Erroneous input: %s\n", nucleobase_string);
return 0;
}
size_t nucleobase_count ;
for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1);
size_t aminoacid_count = nucleobase_count/3;
(*aminoacid_string) = malloc(aminoacid_count);
for(size_t i = 0, slow_i = 0; slow_i < aminoacid_count; i+=3, slow_i+=1)
{
(*aminoacid_string)[slow_i] =
codon_table[
nucleobase_string[i] == 't' ? 0 :
nucleobase_string[i] == 'c' ? 1 :
nucleobase_string[i] == 'a' ? 2 :
nucleobase_string[i] == 'g' ? 3 : -1
][
nucleobase_string[i+1] == 't' ? 0 :
nucleobase_string[i+1] == 'c' ? 1 :
nucleobase_string[i+1] == 'a' ? 2 :
nucleobase_string[i+1] == 'g' ? 3 : -1
][
nucleobase_string[i+2] == 't' ? 0 :
nucleobase_string[i+2] == 'c' ? 1 :
nucleobase_string[i+2] == 'a' ? 2 :
nucleobase_string[i+2] == 'g' ? 3 : -1
];
}
return aminoacid_count;
}
int main()
{
printf("Enter DNA sequence in 5' → 3' direction.\n");
char* input_string = NULL;
size_t n = 0;
ssize_t line_length = getline(&input_string, &n, stdin);
if(line_length == -1)
{
printf("Failed to read a line.\n");
return 0;
}
char* nucleobase_string = malloc(line_length);
memcpy(nucleobase_string, input_string, line_length-1);
nucleobase_string[line_length-1] = '0円';
char** aminoacid_string = NULL;
size_t aminoacid_count = nucleobase_to_aminoacid(nucleobase_string, &aminoacid_string);
for(size_t i = 0; i < aminoacid_count; i+=1)
{
if(i!=0)
{
printf(" + ");
}
printf("%s", aminoacid_string[i]);
}
printf("\n");
return 0;
};
6 Answers 6
An array of char*
is not a great data structure for storing AA or codon sequence data. A pointer takes 4 or 8 bytes per codon, but there are only 64 possible codons so you could easily store one codon per byte as a uint8_t
. And with only about 22 amino acids, that's close to being able to fit 2 AAs per byte, but it doesn't quite work and it's definitely easier to still use a uint8_t
.
Then when you want to print, use that integer as an index into a table of strings.
Or if all you want to do is print out a string, then don't construct an array of strings at all, and just printf as you convert. Or memcpy 4-byte strings like "Phe "
into a buffer to construct a string yourself instead of calling printf
separately for each AA name.
(Compilers are good at 4-byte copies, because that's the size of an integer register. Especially from constant data, because they can inline the data as an immediate if that's optimal. So make your codon table static const
)
slow_i
is an odd variable name. Use aa_idx
or something else meaningful, or maybe j
. Or use one i
variable and use i*3 + 0
, i*3 + 1
, and i*3 + 2
to let the compiler figure out what it wants to do.
This very long line: for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1);
should be nucleobase_count = strlen(nucleobase_string);
. Use library functions instead of re-inventing the wheel in a hard-to-read long line. The libc string functions usually have hand-optimized asm implementations that will be fast especially for long strings.
Or even better, have your nucleotide checking function return a count if it doesn't find problem! You're already looping over the whole string; you don't need to do it again to count the length.
Or even better than that, have your caller pass you the length so you can allocate memory right away and combine checking with converting. You could have the caller pass you an already-allocated output buffer as well as the length, so it has the option of using a C99 variable-length-array, reusing a buffer it already has allocated, or whatever, instead of baking malloc
into your conversion routine.
codon_table[ // your code
nucleobase_string[i] == 't' ? 0 :
nucleobase_string[i] == 'c' ? 1 :
nucleobase_string[i] == 'a' ? 2 :
nucleobase_string[i] == 'g' ? 3 : -1
][...][...]
If a bad nucleotide did slip through the cracks and make it to this code, you'd have undefined behaviour from reading codon_table[-1][0][0]
; out of bounds access. You could remove the == g
part and just have 3
be the last step in the ternary operator, with a comment to indicate that it's g
if it wasn't one of the other 3. Also, that block of ternary operators should go in a tiny helper function like int base_to_idx(char base);
instead of being repeated 3 times.
Then you can re-implement it with a table lookup or a perfect hash function. e.g.
// probably not faster than a table lookup, unless you make a SIMD version
/*
'a' 0b1100001 -> 0b00
'c' 0b1100011 -> 0b10
'g' 0b1100111 -> 0b01
't' 0b1110100 -> 0b11
*/
unsigned nuc_to_idx(unsigned base) {
if (base == 'a' || base == 'c' || base == 'g' || base == 't')
return (base*3 & 0x3<<2)>>2;
return -128U;
}
or a table lookup version that does a whole codon and returns an integer code. I used C designated-initializer syntax, plus a GNU extension to for ranges to make the table static const
. You could also init it at runtime with memset(table, -1, UCHAR_MAX);
and then assign the 4 valid values.
#include <stdint.h>
#include <limits.h>
// int8_t sign extends to the width of int / unsigned int
// so all the high bits get set before treating as unsigned (assuming 2's complement), for BAD_NUC inputs.
// But we don't depend on that: we still get an unsigned value > 64
// on sign/magnitude or one's complement machines if any of the 3 chars is bogus.
#define BAD_NUC -128
static const int8_t nuc_table[UCHAR_MAX+1] = {
[0 ... 255] = BAD_NUC, // ranges are a GNU extension
// last init takes precedence https://gcc.gnu.org/onlinedocs/gcc/Designated-Inits.html
['a'] = 0,
['c'] = 1,
['g'] = 2,
['t'] = 3,
};
// Fun fact: doesn't depend on ASCII encoding because of the designated-initializers.
// unsigned char* so high-ASCII -> 128..255, not negative,
// and works as an index into a 256 entry table
unsigned codon_to_idx_LUT(unsigned char *p) {
unsigned idx = nuc_table[p[0]];
idx = idx*4 + nuc_table[p[1]];
idx = idx*4 + nuc_table[p[2]];
return idx;
// 0..63 for in-range values, otherwise has higher bits set
// so you only have to range-check once per codon, speeding up the common case of no error.
// to map to AA codes, maybe use a lookup table on this 0..63 value
}
The table version compiles really efficiently on x86 with gcc -O3 (on the Godbolt compiler explorer), or hopefully with any sane compiler. (The if condition in the perfect-hash version does compile to an immediate bitmap, which is neat, but it still has to test/branch multiple times for every character instead of just doing 2 loads.)
codon_to_idx_LUT: # gcc7.3 -O3 for x86-64 System V calling convention: pointer arg in RDI
movzx eax, BYTE PTR [rdi]
movsx edx, BYTE PTR nuc_table[rax]
movzx eax, BYTE PTR [rdi+1]
movsx eax, BYTE PTR nuc_table[rax]
lea edx, [rax+rdx*4] # this is why I wrote it as idx*4 + nuc_table[...] instead of some other way of multiplying and adding the results.
movzx eax, BYTE PTR [rdi+2]
movsx eax, BYTE PTR nuc_table[rax]
lea eax, [rax+rdx*4]
ret
This would run even faster if the compiler did one wider load and unpacked it with ALU instructions instead of loading each string byte separately. Running this in a loop is going to bottleneck on 2 loads per clock on mainstream Intel/AMD CPUs, so about 1 nucleotide per clock cycle, which is not bad. The overhead of checking the index for in-range (to detect bogus nucleotide characters) should be negligible compared to the string loads + table lookups. Especially on CPUs that can only do one load per clock.
(Related this Q&A for more about tweaking your C source to hand-hold the compiler into making better asm. You could do that here, but only at the cost of readability, I think. You'd probably have to do a dodgy pointer-cast or a safe memcpy
and read 4 bytes as a uint32_t
and unpack that.)
You can use this 0..63 codon as an index into a table of strings for printing. (Just flatten your current 3D array, although I didn't make any effort to be consistent with which numbers go with which DNA base in the hash or table vs. your version.)
-
\$\begingroup\$ I was going to write a very similar answer. Thanks for saving me time. \$\endgroup\$Konrad Rudolph– Konrad Rudolph2018年04月11日 10:28:39 +00:00Commented Apr 11, 2018 at 10:28
-
1\$\begingroup\$ A clear example of using
BAD_NUC
is not shown (set, but not read). Saving it in a signedint8_t
and unsignedunsigned
could pose concerns. IMO, a positive value like#define BAD_NUC 127
is less problematic. The idea is good, but not sufficient fleshed out. \$\endgroup\$chux– chux2018年04月12日 14:30:36 +00:00Commented Apr 12, 2018 at 14:30 -
\$\begingroup\$ @chux: Yeah, I didn't spend as much time as I could have on that part of the answer. I was personally more interested in something that compiled well with gcc on x86, where I know how an
int8_t
will sign-extend. It's very convoluted, but I think it's safe on any C implementation (sign/magnitude or one's complement), but I might be depending on an assumption about where the sign bits are when casting to unsigned, especially if it's not a type-pun. I kinda wanted something with the high bit set in the final result on out-of-range because that's slightly cheaper for a compiler to check :P \$\endgroup\$Peter Cordes– Peter Cordes2018年04月12日 15:17:06 +00:00Commented Apr 12, 2018 at 15:17 -
\$\begingroup\$ @chux: TL:DR that part isn't great well-thought-out code-review advice, it's more like leftovers from what I was playing around with. \$\endgroup\$Peter Cordes– Peter Cordes2018年04月12日 15:17:45 +00:00Commented Apr 12, 2018 at 15:17
-
1\$\begingroup\$ @chux: heh, good point about
int8_t
requiring 2's complement, I forgot I was getting that guarantee. I used it instead ofsigned char
because I wanted it to fail to compile instead of creating a gigantic initialized array, on systems like DSPs wherechar
is a 32-bit type. (So there's probably noint8_t
.) \$\endgroup\$Peter Cordes– Peter Cordes2018年04月12日 15:28:44 +00:00Commented Apr 12, 2018 at 15:28
The line
(*aminoacid_string) = malloc(aminoacid_count);
allocates
aminoacid_count
of bytes. The code needs that many pointers. The correct way is to*aminoacid_string = malloc(aminoacid_count * sizeof(**aminoacid_string));
Testing for the correct input (i.e. calling
check_nucleobase_string
) belongs to main.I recommend to translate the nucleobase string to a numeric array (mapping
t
to0
etc), and avoid ugly cascade of ternary operators.The loop
for(nucleobase_count = 0; nucleobase_string[nucleobase_count] != '0円'; nucleobase_count+=1);
is a very long (and inefficient) way to say
nucleobase_count = strlen(nucleobase_string);
I don't see the reason to
memcpy(nucleobase_string, input_string, line_length-1);
You may work directly with
input_string
.
-
\$\begingroup\$ @1201ProgramAlarm It is null terminated. The ` = '0円';` assignment kills the newline, and it can be done directly on input_string. No copy is necessary. \$\endgroup\$vnp– vnp2018年04月11日 01:49:33 +00:00Commented Apr 11, 2018 at 1:49
-
\$\begingroup\$ I did not want to work directly with
input_string
because it also contains the delimiter fromgetline
(\n). \$\endgroup\$Darius Duesentrieb– Darius Duesentrieb2018年04月11日 02:32:23 +00:00Commented Apr 11, 2018 at 2:32 -
\$\begingroup\$ @DariusDuesentrieb: but you don't need to preserve that. You can just modify
input_string
and discard the input line endings or lack thereof. If you ever want to print it out again, you can useprintf("%s\n", input_string);
or whatever to normalize the output to have every line end with a newline. \$\endgroup\$Peter Cordes– Peter Cordes2018年04月11日 03:03:09 +00:00Commented Apr 11, 2018 at 3:03 -
\$\begingroup\$ Is
malloc(foo * sizeof(bar))
really the correct way? Notcalloc(foo, sizeof(bar))
? \$\endgroup\$Peter Taylor– Peter Taylor2018年04月11日 09:58:20 +00:00Commented Apr 11, 2018 at 9:58 -
\$\begingroup\$ @PeterTaylor Depends on the intent. If the next thing you do is to fill it up with data, there is no reason to spend time initializing it with zeroes. \$\endgroup\$vnp– vnp2018年04月11日 16:11:14 +00:00Commented Apr 11, 2018 at 16:11
You should try to refactor the code to have less hard-coded constants. E.g. nucleobase_to_aminoacid has both tcag and codon_table hard-coded. That is in general something that hinders re-use.
- You should preferably have a separate function to print the codon_table - in a standard format. This would have detected that you used "Len" instead of "Leu" as abbreviation of "Leucine".
- If the decoding of tcag was a separate function - or an input - you could re-use the same function for mRNA.
- If codon_table was input to nucleobase_to_aminoacid you could re-use the same code for the odd cases with different codon_table.
In addition to current (and future) answers:
You use
malloc()
, but from what I can see, you do notfree()
it later on. In my case valgrind complains that 125 bytes in 2 blocks are "definitely lost" - meaning you've got memory leaks.It is a common notation to use
++i
instead ofi += 1
, especially in for loops.Use the
const
qualifier where applicable. While the compiler may do such optimization for you (although not as often as you would think), it is useful to indicate that something will not be changed.For example,
char *codon_table
can bechar const *const codon_table
. If you use the-Wconversion
flag when compiling with GCC, you can catch all sorts of errors to do with type conversions as well.
-
\$\begingroup\$ Agree with the value of the first
const
inchar const *const codon_table
. The 2ndconst
is unnecessary aschar* codon_table[4][4][4]
is an array and arrays cannot be assigned. \$\endgroup\$chux– chux2018年04月12日 14:21:32 +00:00Commented Apr 12, 2018 at 14:21
Not much left to review given the other good answers. Some additional ideas:
Lopping off a '\n'
Certainly OP meant to lop off the usual trailing '\n'
. The below lops off the last character from a line of input, regardless if it is a '\n'
or not.
nucleobase_string[line_length-1] = '0円'; // May lop off something else.
Consider the end of piped user input may not end with a '\n'
, so although a trailing '\n'
is very common, it is not a cerrtainty. Recommend to test for '\n'
.
if (nucleobase_string[line_length-1] == '\n') {
nucleobase_string[line_length-1] = '0円';
}
// or the one liner (slower)
nucleobase_string[strcspn(nucleobase_string, "\n")] = '0円';
House keeping
Good coding practice would free input_string
in the end. Of course, such allocations are implicitly free'd when the program ends. Yet with an explicit free()
, someone re-using this portion of code would benefit.
free(input_string);
return 0;
Use of stderr
In several parts of your program you use printf(...)
to:
- ask user to provide input
- output error messages
- output result
printf
function writes to stdout. You might consider using fprintf(stderr, ...)
for asking for user input and error log.
Exit status
Also it is common to return nonzero status on program error calling exit(EXIT_FAILURE);
or return EXIT_FAILURE;
from main()
.
This way one may use your program like:
#!/bin/sh
result="$(echo 'whatever' | ./your_program 2>/dev/null)"
if [ $? -ne 0 ]; then
# failed
fi
-
1\$\begingroup\$ @chux, wow, I was editing my answer because of this, when you replied (: Thanks anyway \$\endgroup\$sineemore– sineemore2018年04月12日 17:17:22 +00:00Commented Apr 12, 2018 at 17:17