Matlab implementation of Needleman-Wunsch algorithm

Question 1

This code (an implementation of the path finding Needleman-Wunsch algorithm), given two sequences, correctly aligns and calculates the similarity of them. For example, given two sequences:

AAA,CCC

or

AA,CA

it correctly aligns them as

---AAA
CCC---

and

-AA
CA-

respectively, while returning the similarity.

% Matlab implementation of NW algorithm
% Written by Joseph Farah
% Completed 7/14/16
% Last updated 7/14/16
function similarity = needleman(s1, s2)
% defining variables
% sequence related variables
s1 = strcat('-',s1);
s2 = strcat('-',s2);
seq1 = s1;
seq2 = s2;
%seq1 = '-ACTTAGATTACTTG';
%seq2 = '-CCCACTAGATTATTAG';
fseq1 = '';
fseq2 = '';
% weights
w_s = 2;
w_indel = 1;
% temporary costs for the matrix fill
h_cost = 0;
v_cost = 0;
d_cost = 0;
% the matrices
cost_matrix = zeros(length(seq1),length(seq2));
direction_cell_matrix = cell(length(seq1),length(seq2));
%% Initiliazation and Scoring
for x = 1:length(seq1)
 for y = 1:length(seq2)
 point = [seq1(x), seq2(y)];
 if point(1) == '-' && point(2) == '-'
 cost_matrix(x,y) = 0;
 direction_cell_matrix{x,y} = 'n';
 continue
 end
 % apparently, to get it to go in the right order, moving
 % horizontally is defined by y-1 and vertically by x-1
 if point(1) == '-' % this means it is on the lip -- 
 cost_matrix(x,y) = cost_matrix(x,y-1) + w_indel;
 % the only way it can go is left!
 direction_cell_matrix{x,y} = 'h';
 continue
 end
 if point(2) == '-' % is it vertical?
 cost_matrix(x,y) = cost_matrix(x-1,y) + w_indel;
 % the only way it can go is up!
 direction_cell_matrix{x,y} = 'v';
 continue
 end
 % if none of the above conditions were met, the point must be in
 % the body of the array. First thing we have to do is calculate the
 % cost of the current point. 
 h_cost = cost_matrix(x,y-1) + w_indel;
 v_cost = cost_matrix(x-1,y) + w_indel;
 % the diagonal cost will be determined by this next bit--do the
 % characters match?
 if point(1) == point(2)
 % if they match, there is no cost!
 d_cost = cost_matrix(x-1,y-1) + 0;
 elseif point(1) ~= point(2)
 % if they are a mismatch, the cost is the subsitution weight
 d_cost = cost_matrix(x-1,y-1) + w_s;
 end
 % now lets compare them! time to put them in a list...
 min_list_cost = [h_cost, v_cost, d_cost];
 minimum = min(min_list_cost);
 % now we have a cost, and we put it in the matrix!
 cost_matrix(x,y) = minimum; 
 % one last step-- we need to calculate the optimal direction
 % to do this, we have to find the minimum of the surround paths.
 % Back to the cost variables!
 h_cost = cost_matrix(x,y-1);
 v_cost = cost_matrix(x-1,y);
 d_cost = cost_matrix(x-1,y-1);
 % its important to keep track of the order, because the index of
 % the minimum value of the list will give us the correct direction 
 % for the point!
 min_list_direction = [h_cost, v_cost, d_cost];
 % now we find the minumum AND the index. Without the index, we
 % dont know which way to go!
 % minimum = min(min_list_direction);
 % now that we have a minumum, which direction is it in?
 % we went h, v, d
 index = find(min_list_cost == minimum);
 if index(1) == 1
 direction_cell_matrix{x,y} = 'h';
 elseif index(1) == 2
 direction_cell_matrix{x,y} = 'v';
 elseif index(1) == 3
 direction_cell_matrix{x,y} = 'd';
 end
 end
end
%%Traceback
% defining variables for how much we are deviating from the end point
% the final sequence at its worst will be as long as the first sequence
% plus the second sequence. The loop cannot go on for longer than that.
x = length(seq2);
y = length(seq1);
% the coordinate variables above will be updated so that moving through the
% direction matrix will match moving through the sequences
for ii = 1:(length(seq1)+length(seq2))
 if direction_cell_matrix{y,x} == 'n'
 break
 end 
 if direction_cell_matrix{y,x} == 'h'
 fseq2 = strcat(seq2(x),fseq2);
 fseq1 = strcat('-',fseq1);
 y = y;
 x = x - 1;
 elseif direction_cell_matrix{y,x} == 'v'
 fseq2 = strcat('-',fseq2);
 fseq1 = strcat(seq1(y),fseq1);
 y = y - 1;
 x = x;
 elseif direction_cell_matrix{y,x} == 'd'
 fseq2 = strcat(seq2(x),fseq2);
 fseq1 = strcat(seq1(y),fseq1);
 y = y - 1;
 x = x - 1;
 end
end
sim_score = 0;
for ii=1:length(fseq1)
 if fseq1(ii) == fseq2(ii)
 sim_score = sim_score + 1;
 else
 continue
 end
end
similarity = sim_score/length(fseq1);

What can I do to improve:

Efficiency?
Readability?
the Matlab-iness of the code? (i.e where can I take advantage of tools Matlab has to offer?)

Question 2

I'll take this from the bottom up:

The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:

sim_score = 0;
for ii=1:length(fseq1)
 if fseq1(ii) == fseq2(ii)
 sim_score = sim_score + 1;
 else
 continue
 end
end
similarity = sim_score/length(fseq1);

Can instead be written like this:

similarity = sum(fseq1 == fseq2) / numel(fseq1);

or if you want the sim_score too:

sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);

Note that numel is faster and more robust than length.

Timing of loop approach versus vectorized approach:

enter image description here

You are using strcat a lot. strcat is good as it's easy to read and understand that you're concatenating strings (vector of characters). However, regular concatenating with brackets [str1, str2] is faster:

Instead of fseq2 = strcat('-',fseq2); you can do fseq2 = ['-', fseq2].

Timing of strcat vs brackets:

enter image description here

You're using only the first elements of the index variable after calling find. find has a second optional parameter that specifies the number of indices you want to store, so you can do:

index = find(min_list_cost == minimum, 1);
if index == 1
 ...
if index == 2
 ...

Specifying that you only want the first index improve the runtime of up to 30%.

If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own find_first function. This assumption is safe if you have a lot of repetitive elements, such as AACCTTATTCTTA...

A very simple function that has runtime O(1) if the first equal value appear in the beginning of the vector.

function idx = find_first(x, val)
idx = 0;
for ii = 1:numel(x)
 if x(ii) == val
 idx = ii;
 break
 end
end
end

Stewie Griffin Stewie Griffin 2,0771 gold badge18 silver badges34 bronze badges · Accepted Answer · 2016-07-23 08:54:29Z

I'll take this from the bottom up:

The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:

sim_score = 0;
for ii=1:length(fseq1)
 if fseq1(ii) == fseq2(ii)
 sim_score = sim_score + 1;
 else
 continue
 end
end
similarity = sim_score/length(fseq1);

Can instead be written like this:

similarity = sum(fseq1 == fseq2) / numel(fseq1);

or if you want the sim_score too:

sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);

Note that numel is faster and more robust than length.

Timing of loop approach versus vectorized approach:

enter image description here

You are using strcat a lot. strcat is good as it's easy to read and understand that you're concatenating strings (vector of characters). However, regular concatenating with brackets [str1, str2] is faster:

Instead of fseq2 = strcat('-',fseq2); you can do fseq2 = ['-', fseq2].

Timing of strcat vs brackets:

enter image description here

You're using only the first elements of the index variable after calling find. find has a second optional parameter that specifies the number of indices you want to store, so you can do:

index = find(min_list_cost == minimum, 1);
if index == 1
 ...
if index == 2
 ...

Specifying that you only want the first index improve the runtime of up to 30%.

If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own find_first function. This assumption is safe if you have a lot of repetitive elements, such as AACCTTATTCTTA...

A very simple function that has runtime O(1) if the first equal value appear in the beginning of the vector.

function idx = find_first(x, val)
idx = 0;
for ii = 1:numel(x)
 if x(ii) == val
 idx = ii;
 break
 end
end
end

Stack Exchange Network

Matlab implementation of Needleman-Wunsch algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Matlab implementation of Needleman-Wunsch algorithm

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions