6
\$\begingroup\$

The Needleman-Wunsch algorithm is a way to align sequences in a way that optimizes "similarity". Usually, a grid is generated and then you follow a path down the grid (based off the largest value) to compute the optimal alignment between two sequences. I have created a Python program, that given two strings, will create the resulting matrix for the Needleman-Wunsch algorithm. Currently, the program when ran will generate two random sequences of DNA and then print out the resulting Needleman-Wunsch matrix.

needlemanwunsch.py

import random
from tabulate import tabulate
class NeedlemanWunsch:
 """
 Class used for generating Needleman-Wunsch matrices.
 """
 def _compute_block(self, result, i, j):
 """
 Given a block (corresponding to a 2 x 2 matrix), calculate the value o-
 f the bottom right corner.
 (Based on the equation:
 M_{i,j} = max(M_{i-1,j-1} + S_{i,j},
 M_{i,j-1} + W,
 M_{i-1,j} + W)
 Found here: https://vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1)
 Args:
 result : The current matrix that is being computed.
 i : The right most part of the block being computed.
 j : The bottom most part of the block being computed.
 Returns:
 The value for the right bottom corner of a particular block.
 """
 return max(result[i-1][j-1] +
 self._calc_weight(self._second_seq[i-1],
 self._first_seq[j-1]),
 result[i-1][j] + self.gap,
 result[i][j-1] + self.gap)
 def _calc_weight(self, first_char, second_char):
 """
 Helper function, given two characters determines (based on the sc-
 oring scheme) what the score for the particular characters can be.
 Args:
 first_char : A character to compare.
 second_char : A character to compare.
 Returns:
 Either self.match or self.mismatch.
 """
 if first_char == second_char:
 return self.match
 else:
 return self.mismatch
 def generate(self, first_seq, second_seq):
 """
 Generates a matrix corresponding to the scores to the Needleman-Wu-
 nsch algorithm.
 Args:
 first_seq : One of the sequences to be compared for similarity.
 second_seq : One of the sequences to be compared for
 similarity.
 Returns:
 A 2D list corresponding to the resulting matrix of the Needlem-
 an-Wunsch algorithm.
 """
 # Internally requies that the first sequence is longer.
 if len(second_seq) > len(first_seq):
 first_seq, second_seq = second_seq, first_seq
 self._first_seq = first_seq
 self._second_seq = second_seq
 # Adjust sequence with "intial space"
 # Initialize the resulting matrix with the initial row.
 result = [list(range(0, -len(first_seq) - 1, -1))]
 # Create initial columns.
 for i in range(-1, -len(second_seq) - 1, -1):
 row = [i]
 row.extend([0]*len(first_seq))
 result.append(row)
 # Sweep through and compute each new cell row-wise.
 for i in range(1, len(result)):
 for j in range(1, len(result[0])):
 result[i][j] = self._compute_block(result, i, j)
 # Format for prettier printing.
 for index, letter in enumerate(second_seq):
 result[index + 1].insert(0, letter)
 result[0].insert(0, ' ')
 result.insert(0, list(" " + first_seq))
 return result
 def __init__(self, match=1, mismatch=-1, gap=-1):
 """
 Initialize the Needleman-Wunsch class so that it provides weights for
 match (default 1), mismatch (default -1), and gap (default -1).
 """
 self.match = match
 self.mismatch = mismatch
 self.gap = gap
 self._first_seq = ""
 self._second_seq = ""
def deletion(seq, pos):
 """
 Deletes a random base pair from a sequence at a specified position.
 Args:
 seq : Sequence to perform deletion on.
 pos : Location of deletion.
 Returns:
 seq with character removed at pos.
 """
 return seq[:pos] + seq[pos:]
def base_change(seq, pos):
 """
 Changes a random base pair to another base pair at a specified position.
 Args:
 seq : Sequence to perform base change on.
 pos : Locaion of base change.
 Returns:
 seq with character changed at pos.
 """
 new_base = random.choice("ACTG".replace(seq[pos], ""))
 return seq[:pos] + new_base + seq[pos:]
def mutate(seq, rounds=3):
 """
 Mutates a piece of DNA by randomly applying a deletion or base change
 Args:
 seq : The sequence to be mutated.
 rounds : Defaults to 3, the number of mutations to be made.
 Returns:
 A mutated sequence.
 """
 mutations = (deletion, base_change)
 for _ in range(rounds):
 pos = random.randrange(len(seq))
 seq = random.choice(mutations)(seq, pos)
 return seq
def main():
 """
 Creates a random couple of strings and creates the corresponding Needleman
 -Wunsch matrix associated with them.
 """
 needleman_wunsch = NeedlemanWunsch()
 first_seq = ''.join(random.choices("ACTG", k=5))
 second_seq = mutate(first_seq)
 data = needleman_wunsch.generate(first_seq, second_seq)
 print(tabulate(data, headers="firstrow"))
if __name__ == '__main__':
 main()

I ended up using a NeedlemanWunsch class, because using only function resulted in a lot DRY for the parameters match, mismatch, and gap.

I am not particularly fond of the code. I haven't used numpy or any related libraries because I couldn't see any way that it would significantly shorten the code, however, I would be willing to use numpy if there is a significantly shorter way of expressing the matrix generation. However, the code seems frail, and very prone to off by one errors.

asked Dec 21, 2018 at 5:59
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

Code

  • base_change is not good name for function. It's suggest change. At wikipedia they use Indel (INsertion or DELetion) names.
  • first_seq and second_seq strings could be lists. In this case mutate/deletion/base_change function can do its stuff in-place
  • _first_seq and _second_seq changes after every call generate method. No need to cache these variables, because of they not use in future by public method of class
  • numpy simplify code

Design

  • you have two main public functionalities: mutate and generate methods. generate mess presentation layer and logic one. Generally it's not good idea. Imho better design generate (needleman_wunsch) to calc only logic (without first row and column with first_seq, second_seq). Additional method print_needleman_wunsch_matrix could add these lines if needed.

Example code (without design warning, additionally i exchange tabulate for pandas but this no needed)

import numpy as np
import pandas as pd
from random import choice, choices, randrange
def needleman_wunsch(first, second, match=1, mismatch=-1, gap=-1):
 tab = np.full((len(second) + 2, len(first) + 2), ' ', dtype=object)
 tab[0, 2:] = first
 tab[1, 1:] = list(range(0, -len(first) - 1, -1))
 tab[2:, 0] = second
 tab[1:, 1] = list(range(0, -len(second) - 1, -1))
 is_equal = {True: match, False: mismatch}
 for f in range(2, len(first) + 2):
 for s in range(2, len(second) + 2):
 tab[s, f] = max(tab[s - 1][f - 1] + is_equal[first[f - 2] == second[s - 2]],
 tab[s - 1][f] + gap,
 tab[s][f - 1] + gap)
 return tab
def mutate(seq, rounds=3):
 mutate_seq = seq.copy()
 for change in choices((deletion, insertion), k=rounds):
 pos = randrange(len(mutate_seq))
 change(mutate_seq, pos)
 return mutate_seq
def deletion(seq, idx):
 seq.pop(idx)
def insertion(seq, idx):
 seq.insert(idx, choice("ACTG".replace(seq[idx], "")))
def main():
 first_seq = choices("ACTG", k=5)
 second_seq = mutate(first_seq)
 data = needleman_wunsch(first_seq, second_seq)
 print(pd.DataFrame(data))
if __name__ == '__main__':
 main()
answered Dec 22, 2018 at 19:01
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.