Needleman-Wunsch Grid Generation in Python

Question 1

The Needleman-Wunsch algorithm is a way to align sequences in a way that optimizes "similarity". Usually, a grid is generated and then you follow a path down the grid (based off the largest value) to compute the optimal alignment between two sequences. I have created a Python program, that given two strings, will create the resulting matrix for the Needleman-Wunsch algorithm. Currently, the program when ran will generate two random sequences of DNA and then print out the resulting Needleman-Wunsch matrix.

needlemanwunsch.py

import random
from tabulate import tabulate
class NeedlemanWunsch:
 """
 Class used for generating Needleman-Wunsch matrices.
 """
 def _compute_block(self, result, i, j):
 """
 Given a block (corresponding to a 2 x 2 matrix), calculate the value o-
 f the bottom right corner.
 (Based on the equation:
 M_{i,j} = max(M_{i-1,j-1} + S_{i,j},
 M_{i,j-1} + W,
 M_{i-1,j} + W)
 Found here: https://vlab.amrita.edu/?sub=3&brch=274&sim=1431&cnt=1)
 Args:
 result : The current matrix that is being computed.
 i : The right most part of the block being computed.
 j : The bottom most part of the block being computed.
 Returns:
 The value for the right bottom corner of a particular block.
 """
 return max(result[i-1][j-1] +
 self._calc_weight(self._second_seq[i-1],
 self._first_seq[j-1]),
 result[i-1][j] + self.gap,
 result[i][j-1] + self.gap)
 def _calc_weight(self, first_char, second_char):
 """
 Helper function, given two characters determines (based on the sc-
 oring scheme) what the score for the particular characters can be.
 Args:
 first_char : A character to compare.
 second_char : A character to compare.
 Returns:
 Either self.match or self.mismatch.
 """
 if first_char == second_char:
 return self.match
 else:
 return self.mismatch
 def generate(self, first_seq, second_seq):
 """
 Generates a matrix corresponding to the scores to the Needleman-Wu-
 nsch algorithm.
 Args:
 first_seq : One of the sequences to be compared for similarity.
 second_seq : One of the sequences to be compared for
 similarity.
 Returns:
 A 2D list corresponding to the resulting matrix of the Needlem-
 an-Wunsch algorithm.
 """
 # Internally requies that the first sequence is longer.
 if len(second_seq) > len(first_seq):
 first_seq, second_seq = second_seq, first_seq
 self._first_seq = first_seq
 self._second_seq = second_seq
 # Adjust sequence with "intial space"
 # Initialize the resulting matrix with the initial row.
 result = [list(range(0, -len(first_seq) - 1, -1))]
 # Create initial columns.
 for i in range(-1, -len(second_seq) - 1, -1):
 row = [i]
 row.extend([0]*len(first_seq))
 result.append(row)
 # Sweep through and compute each new cell row-wise.
 for i in range(1, len(result)):
 for j in range(1, len(result[0])):
 result[i][j] = self._compute_block(result, i, j)
 # Format for prettier printing.
 for index, letter in enumerate(second_seq):
 result[index + 1].insert(0, letter)
 result[0].insert(0, ' ')
 result.insert(0, list(" " + first_seq))
 return result
 def __init__(self, match=1, mismatch=-1, gap=-1):
 """
 Initialize the Needleman-Wunsch class so that it provides weights for
 match (default 1), mismatch (default -1), and gap (default -1).
 """
 self.match = match
 self.mismatch = mismatch
 self.gap = gap
 self._first_seq = ""
 self._second_seq = ""
def deletion(seq, pos):
 """
 Deletes a random base pair from a sequence at a specified position.
 Args:
 seq : Sequence to perform deletion on.
 pos : Location of deletion.
 Returns:
 seq with character removed at pos.
 """
 return seq[:pos] + seq[pos:]
def base_change(seq, pos):
 """
 Changes a random base pair to another base pair at a specified position.
 Args:
 seq : Sequence to perform base change on.
 pos : Locaion of base change.
 Returns:
 seq with character changed at pos.
 """
 new_base = random.choice("ACTG".replace(seq[pos], ""))
 return seq[:pos] + new_base + seq[pos:]
def mutate(seq, rounds=3):
 """
 Mutates a piece of DNA by randomly applying a deletion or base change
 Args:
 seq : The sequence to be mutated.
 rounds : Defaults to 3, the number of mutations to be made.
 Returns:
 A mutated sequence.
 """
 mutations = (deletion, base_change)
 for _ in range(rounds):
 pos = random.randrange(len(seq))
 seq = random.choice(mutations)(seq, pos)
 return seq
def main():
 """
 Creates a random couple of strings and creates the corresponding Needleman
 -Wunsch matrix associated with them.
 """
 needleman_wunsch = NeedlemanWunsch()
 first_seq = ''.join(random.choices("ACTG", k=5))
 second_seq = mutate(first_seq)
 data = needleman_wunsch.generate(first_seq, second_seq)
 print(tabulate(data, headers="firstrow"))
if __name__ == '__main__':
 main()

I ended up using a NeedlemanWunsch class, because using only function resulted in a lot DRY for the parameters match, mismatch, and gap.

I am not particularly fond of the code. I haven't used numpy or any related libraries because I couldn't see any way that it would significantly shorten the code, however, I would be willing to use numpy if there is a significantly shorter way of expressing the matrix generation. However, the code seems frail, and very prone to off by one errors.

Question 2

Code

base_change is not good name for function. It's suggest change. At wikipedia they use Indel (INsertion or DELetion) names.
first_seq and second_seq strings could be lists. In this case mutate/deletion/base_change function can do its stuff in-place
_first_seq and _second_seq changes after every call generate method. No need to cache these variables, because of they not use in future by public method of class
numpy simplify code

Design

you have two main public functionalities: mutate and generate methods. generate mess presentation layer and logic one. Generally it's not good idea. Imho better design generate (needleman_wunsch) to calc only logic (without first row and column with first_seq, second_seq). Additional method print_needleman_wunsch_matrix could add these lines if needed.

Example code (without design warning, additionally i exchange tabulate for pandas but this no needed)

import numpy as np
import pandas as pd
from random import choice, choices, randrange
def needleman_wunsch(first, second, match=1, mismatch=-1, gap=-1):
 tab = np.full((len(second) + 2, len(first) + 2), ' ', dtype=object)
 tab[0, 2:] = first
 tab[1, 1:] = list(range(0, -len(first) - 1, -1))
 tab[2:, 0] = second
 tab[1:, 1] = list(range(0, -len(second) - 1, -1))
 is_equal = {True: match, False: mismatch}
 for f in range(2, len(first) + 2):
 for s in range(2, len(second) + 2):
 tab[s, f] = max(tab[s - 1][f - 1] + is_equal[first[f - 2] == second[s - 2]],
 tab[s - 1][f] + gap,
 tab[s][f - 1] + gap)
 return tab
def mutate(seq, rounds=3):
 mutate_seq = seq.copy()
 for change in choices((deletion, insertion), k=rounds):
 pos = randrange(len(mutate_seq))
 change(mutate_seq, pos)
 return mutate_seq
def deletion(seq, idx):
 seq.pop(idx)
def insertion(seq, idx):
 seq.insert(idx, choice("ACTG".replace(seq[idx], "")))
def main():
 first_seq = choices("ACTG", k=5)
 second_seq = mutate(first_seq)
 data = needleman_wunsch(first_seq, second_seq)
 print(pd.DataFrame(data))
if __name__ == '__main__':
 main()

vaeta vaeta 8865 silver badges8 bronze badges · Answer 1 · 2018-12-22 19:01:45Z

Code

base_change is not good name for function. It's suggest change. At wikipedia they use Indel (INsertion or DELetion) names.
first_seq and second_seq strings could be lists. In this case mutate/deletion/base_change function can do its stuff in-place
_first_seq and _second_seq changes after every call generate method. No need to cache these variables, because of they not use in future by public method of class
numpy simplify code

Design

you have two main public functionalities: mutate and generate methods. generate mess presentation layer and logic one. Generally it's not good idea. Imho better design generate (needleman_wunsch) to calc only logic (without first row and column with first_seq, second_seq). Additional method print_needleman_wunsch_matrix could add these lines if needed.

Example code (without design warning, additionally i exchange tabulate for pandas but this no needed)

import numpy as np
import pandas as pd
from random import choice, choices, randrange
def needleman_wunsch(first, second, match=1, mismatch=-1, gap=-1):
 tab = np.full((len(second) + 2, len(first) + 2), ' ', dtype=object)
 tab[0, 2:] = first
 tab[1, 1:] = list(range(0, -len(first) - 1, -1))
 tab[2:, 0] = second
 tab[1:, 1] = list(range(0, -len(second) - 1, -1))
 is_equal = {True: match, False: mismatch}
 for f in range(2, len(first) + 2):
 for s in range(2, len(second) + 2):
 tab[s, f] = max(tab[s - 1][f - 1] + is_equal[first[f - 2] == second[s - 2]],
 tab[s - 1][f] + gap,
 tab[s][f - 1] + gap)
 return tab
def mutate(seq, rounds=3):
 mutate_seq = seq.copy()
 for change in choices((deletion, insertion), k=rounds):
 pos = randrange(len(mutate_seq))
 change(mutate_seq, pos)
 return mutate_seq
def deletion(seq, idx):
 seq.pop(idx)
def insertion(seq, idx):
 seq.insert(idx, choice("ACTG".replace(seq[idx], "")))
def main():
 first_seq = choices("ACTG", k=5)
 second_seq = mutate(first_seq)
 data = needleman_wunsch(first_seq, second_seq)
 print(pd.DataFrame(data))
if __name__ == '__main__':
 main()

Stack Exchange Network

Needleman-Wunsch Grid Generation in Python

needlemanwunsch.py

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Needleman-Wunsch Grid Generation in Python

needlemanwunsch.py

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions