3
\$\begingroup\$

This code (an implementation of the path finding Needleman-Wunsch algorithm), given two sequences, correctly aligns and calculates the similarity of them. For example, given two sequences:

AAA,CCC

or

AA,CA

it correctly aligns them as

---AAA
CCC---

and

-AA
CA-

respectively, while returning the similarity.

% Matlab implementation of NW algorithm
% Written by Joseph Farah
% Completed 7/14/16
% Last updated 7/14/16
function similarity = needleman(s1, s2)
% defining variables
% sequence related variables
s1 = strcat('-',s1);
s2 = strcat('-',s2);
seq1 = s1;
seq2 = s2;
%seq1 = '-ACTTAGATTACTTG';
%seq2 = '-CCCACTAGATTATTAG';
fseq1 = '';
fseq2 = '';
% weights
w_s = 2;
w_indel = 1;
% temporary costs for the matrix fill
h_cost = 0;
v_cost = 0;
d_cost = 0;
% the matrices
cost_matrix = zeros(length(seq1),length(seq2));
direction_cell_matrix = cell(length(seq1),length(seq2));
%% Initiliazation and Scoring
for x = 1:length(seq1)
 for y = 1:length(seq2)
 point = [seq1(x), seq2(y)];
 if point(1) == '-' && point(2) == '-'
 cost_matrix(x,y) = 0;
 direction_cell_matrix{x,y} = 'n';
 continue
 end
 % apparently, to get it to go in the right order, moving
 % horizontally is defined by y-1 and vertically by x-1
 if point(1) == '-' % this means it is on the lip -- 
 cost_matrix(x,y) = cost_matrix(x,y-1) + w_indel;
 % the only way it can go is left!
 direction_cell_matrix{x,y} = 'h';
 continue
 end
 if point(2) == '-' % is it vertical?
 cost_matrix(x,y) = cost_matrix(x-1,y) + w_indel;
 % the only way it can go is up!
 direction_cell_matrix{x,y} = 'v';
 continue
 end
 % if none of the above conditions were met, the point must be in
 % the body of the array. First thing we have to do is calculate the
 % cost of the current point. 
 h_cost = cost_matrix(x,y-1) + w_indel;
 v_cost = cost_matrix(x-1,y) + w_indel;
 % the diagonal cost will be determined by this next bit--do the
 % characters match?
 if point(1) == point(2)
 % if they match, there is no cost!
 d_cost = cost_matrix(x-1,y-1) + 0;
 elseif point(1) ~= point(2)
 % if they are a mismatch, the cost is the subsitution weight
 d_cost = cost_matrix(x-1,y-1) + w_s;
 end
 % now lets compare them! time to put them in a list...
 min_list_cost = [h_cost, v_cost, d_cost];
 minimum = min(min_list_cost);
 % now we have a cost, and we put it in the matrix!
 cost_matrix(x,y) = minimum; 
 % one last step-- we need to calculate the optimal direction
 % to do this, we have to find the minimum of the surround paths.
 % Back to the cost variables!
 h_cost = cost_matrix(x,y-1);
 v_cost = cost_matrix(x-1,y);
 d_cost = cost_matrix(x-1,y-1);
 % its important to keep track of the order, because the index of
 % the minimum value of the list will give us the correct direction 
 % for the point!
 min_list_direction = [h_cost, v_cost, d_cost];
 % now we find the minumum AND the index. Without the index, we
 % dont know which way to go!
 % minimum = min(min_list_direction);
 % now that we have a minumum, which direction is it in?
 % we went h, v, d
 index = find(min_list_cost == minimum);
 if index(1) == 1
 direction_cell_matrix{x,y} = 'h';
 elseif index(1) == 2
 direction_cell_matrix{x,y} = 'v';
 elseif index(1) == 3
 direction_cell_matrix{x,y} = 'd';
 end
 end
end
%%Traceback
% defining variables for how much we are deviating from the end point
% the final sequence at its worst will be as long as the first sequence
% plus the second sequence. The loop cannot go on for longer than that.
x = length(seq2);
y = length(seq1);
% the coordinate variables above will be updated so that moving through the
% direction matrix will match moving through the sequences
for ii = 1:(length(seq1)+length(seq2))
 if direction_cell_matrix{y,x} == 'n'
 break
 end 
 if direction_cell_matrix{y,x} == 'h'
 fseq2 = strcat(seq2(x),fseq2);
 fseq1 = strcat('-',fseq1);
 y = y;
 x = x - 1;
 elseif direction_cell_matrix{y,x} == 'v'
 fseq2 = strcat('-',fseq2);
 fseq1 = strcat(seq1(y),fseq1);
 y = y - 1;
 x = x;
 elseif direction_cell_matrix{y,x} == 'd'
 fseq2 = strcat(seq2(x),fseq2);
 fseq1 = strcat(seq1(y),fseq1);
 y = y - 1;
 x = x - 1;
 end
end
sim_score = 0;
for ii=1:length(fseq1)
 if fseq1(ii) == fseq2(ii)
 sim_score = sim_score + 1;
 else
 continue
 end
end
similarity = sim_score/length(fseq1);

What can I do to improve:

  • Efficiency?
  • Readability?
  • the Matlab-iness of the code? (i.e where can I take advantage of tools Matlab has to offer?)
200_success
145k22 gold badges190 silver badges478 bronze badges
asked Jul 18, 2016 at 18:33
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

I'll take this from the bottom up:

The last part of the code that calculates the similarity can be vectorized by comparing all elements of the two vectors and summing the ones that are equal:

sim_score = 0;
for ii=1:length(fseq1)
 if fseq1(ii) == fseq2(ii)
 sim_score = sim_score + 1;
 else
 continue
 end
end
similarity = sim_score/length(fseq1);

Can instead be written like this:

similarity = sum(fseq1 == fseq2) / numel(fseq1);

or if you want the sim_score too:

sim_score = sum(fseq1 == fseq);
similarity = sim_score / numel(fseq1);

Note that numel is faster and more robust than length.

Timing of loop approach versus vectorized approach:

enter image description here


You are using strcat a lot. strcat is good as it's easy to read and understand that you're concatenating strings (vector of characters). However, regular concatenating with brackets [str1, str2] is faster:

Instead of fseq2 = strcat('-',fseq2); you can do fseq2 = ['-', fseq2].

Timing of strcat vs brackets:

enter image description here


You're using only the first elements of the index variable after calling find. find has a second optional parameter that specifies the number of indices you want to store, so you can do:

index = find(min_list_cost == minimum, 1);
if index == 1
 ...
if index == 2
 ...

Specifying that you only want the first index improve the runtime of up to 30%.

If you can safely assume that the equal elements appear in the start of the vectors, you're better of defining your own find_first function. This assumption is safe if you have a lot of repetitive elements, such as AACCTTATTCTTA...

A very simple function that has runtime O(1) if the first equal value appear in the beginning of the vector.

function idx = find_first(x, val)
idx = 0;
for ii = 1:numel(x)
 if x(ii) == val
 idx = ii;
 break
 end
end
end
answered Jul 23, 2016 at 8:54
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.