Merging Overlapping DNA segments into 1 complete segment

Question 1

Did this challenge for an employer (not under an NDA or anything) for a junior python-related position and wasn't selected. Unfortunately they couldn't offer feedback, and as this was my first stab at doing something in python, I really have no idea if maybe I missed something and it's not totally correct, or maybe it doesn't follow best practices in concern to Python. My overall runtime did end up being \$\mathcal{O}(n^2)\$ but couldn't think of a way to improve it. Anyways, I would appreciate any sort of feedback.

Challenge:

The input is at most 50 DNA sequences (i.e, the character set is limited to T/C/G/A) whose length does not exceed 1000 characters.

The sequences are given in FASTA format. These sequences are all different fragments of one chromosome.

The specific set of sequences you will get satisfy a very unique property: there exists a unique way to reconstruct the entire chromosome from these reads by gluing together pairs of reads that overlap by more than half their length. An example set of input strings is attached.

The output of your program should be this unique sequence that contains each of the given input strings as a substring.

My overall approach (assuming no redundant DNA segments) is to start off with a random sequence A and find other sequences containing A[0-X] that also overlap with sequence A. As I overlap, A[0-X] will keep changing. Once A[0-X] stops changing, I switch over to the other side of A and do the same thing.

Pseudocode of algorithm:

Start off with a random sequence A and find other sequences containing A[0-X]. 
Go through the list of sequences one by one 
 If a sequence B contains substring A[0-X]
 Check for an overlap on the left end of sequence B
 if there's an overlap combine the sequences
 Re-assign new A[0-X]
 A[0-X] hasn't changed? switch to other end of the sequence (looking at A[length-X, length])

At this point (built off of two ends as far as I can go) I know I have the entire piece, so leftover segments would just be enveloped completely.

Main function:

X = 10 # the length we're testing
def main(): 
 # prompt user to open file 
 filename = input("Enter filename with same format as example.txt\n")
 # try to open file and read it's contents
 sequences = readFile(filename) 
 # if we succeeded
 if sequences != None:
 # build off of one sequence
 checkSequence = sequences[0]
 del sequences[0] # delete it from list 
 # build off of left end of checkSequence
 checkSequence, sequences = buildAlgorithm(checkSequence, sequences, 0) # print(len(sequences))
 # build off of right end of checkSequence
 checkSequence, sequences = buildAlgorithm(checkSequence, sequences, 1) # print(len(sequences))
 # write final sequence to file 
 fileCheckSequence = open('output.txt', 'w')
 fileCheckSequence.write(checkSequence)
 fileCheckSequence.close()
 # write leftovers to file as well
 fileCheckList = open('leftovers.txt', 'w')
 for val in sequences:
 fileCheckList.write(val)
 fileCheckList.write("\n")
 fileCheckSequence.close()
 sys.exit(0)

Main work:

"""
Facilitates the search and adding on of sequences
Input:
 checkSequence: Total sequence so far
 sequences: list of sequences not used yet
 flag: 0-> build off right end 1->build off left end
Output: 
 checkSequence, sequences 
"""
def buildAlgorithm(checkSequence, sequences, flag):
 if flag == 0:
 keySequence = checkSequence[:X] # grab first 10 characters
 else:
 keySequence = checkSequence[-X:] # grab last 10 characters
 # reverse for convenience, puts search value at 0 
 reverseKey = keySequence[::-1] # extended slice O(len(slice))
 idx = 0 # counter for list of sequences
 i = X-1 # counter for each individual sequence
 # Go through and check for a matching sequence
 while idx < len(sequences):
 val = sequences[idx]
 while i < len(val): # Not Boyer Moore, but has the whole word shift , grows O(n)
 # start comparing at the Xth letter of sequence
 cs = val[i] 
 ks = reverseKey[0] # end value of keySequence
 # compare two letters
 if cs == ks: 
 # if they match, compare entire chunks
 if compareTwoSequences(val, reverseKey, i): # grows O(n)
 beforeSequence = keySequence
 """
 double check they can fit together, then compare
 right or left chunk depending on if we're building 
 off right or left ends
 """
 if flag == 0: # compare right part of sequence
 if compareOverlap(checkSequence, val, i, 0): # grows O(n)
 checkSequence = overlapSequences(checkSequence, val, i, 0)
 keySequence = checkSequence[:X] # new keySequence
 del sequences[idx] # delete that sequence
 idx = 0 
 # it means we didn't accomplish anything, we're done
 if beforeSequence == keySequence:
 return checkSequence, sequences
 else:
 reverseKey = keySequence[::-1]
 else: # compare left part of sequence
 if compareOverlap(checkSequence, val, i, 1): # grows O(n)
 checkSequence = overlapSequences(checkSequence, val, i, 1)
 keySequence = checkSequence[-X:] # new keySequence
 del sequences[idx] # delete that sequence
 idx = 0 # start idx back to 0
 # don't need to do it this time around
 reverseKey = keySequence[::-1] 
 else:
 # if not, check if cs is in subsequence at all 
 if not checkCs(reverseKey, cs):
 i+=10 # skip over 10 spaces
 i+=1
 i = X-1 # reset position of i
 idx+=1 # evaluate next sequence 
 return checkSequence, sequences

The rest of the functions:

"""
Checks whether a character exists within a given sequence
Input: 
 keySequence: given sequence
 cs: character to search for 
Output: 
 True (character exists within sequence), False (character DNE within sequence)
"""
def checkCs(keySequence, cs):
 idx = 0
 while idx < len(keySequence):
 if cs == keySequence[idx]:
 return True
 idx+=1
 #if we got here, cs isn't in keySequence
 return False
"""
Checks whether a given sequence exists in a longer sequence 
Input: 
 checkSequence: longer sequence
 reverseKey: short sequence who's length dictates X
 i: index from where to start on longer sequence
Output: 
 True (they match), False (they do not match)
"""
def compareTwoSequences(checkSequence, reverseKey, i):
 cs = checkSequence[i]
 ks = reverseKey[0]
 idx = 0
 # compare starting from i and 0
 while idx < len(reverseKey):
 cs = checkSequence[i]
 ks = reverseKey[idx]
 if cs != ks:
 return False
 else:
 idx+=1
 i-=1
 # if we made it here, they match
 return True
"""
Checks if a beginning or end chunk of a shorter sequence exists in a longer sequence
Input: 
 checkSequence: longer sequence
 valSequence: shorter sequence
 i: index of starting position for shorter sequence
 flag: 0 (check beginning), 1 (check end)
Output:
 True (they math), False (they do not match)
"""
def compareOverlap(checkSequence, valSequence, i, flag):
 idx = i
 # compare forward
 if flag == 0:
 idx1 = X-1
 while idx < len(valSequence):
 if valSequence[idx] != checkSequence[idx1]:
 return False
 idx+=1
 idx1+=1 
 # compare backward
 else:
 idx1 = len(checkSequence) - 1
 while idx > 0:
 if valSequence[idx] != checkSequence[idx1]:
 return False
 idx-=1
 idx1-=1
 return True
"""
Combined two sequences where they overlap
Input: 
 checkSequence: original sequence 
 val: smaller sequence that will be added to checkSequence
 i: index of val, where overlap starts
 flag: 0 (overlap forwards), 1 (overlap backwards)
Output: 
 new compiled sequence
"""
def overlapSequences(checkSequence, val, i, flag):
 sequenceChunk = ""
 # overlap forward
 if flag == 0:
 idx = 0
 while idx < i-9:
 sequenceChunk = sequenceChunk + val[idx]
 idx+=1
 checkSequence = sequenceChunk + checkSequence
 # overlap backwards
 else:
 idx = i+1
 while idx < len(val):
 sequenceChunk = sequenceChunk + val[idx]
 idx+=1
 checkSequence = checkSequence + sequenceChunk
 return checkSequence
"""
Reads in file to extract sequences and input them into a list
Input: filename
Output: list of sequences or error
"""
def readFile(filename):
 try:
 in_file = open(filename + ".txt", "r")
 print ("File opened successfully\n")
 line = in_file.readline() #read in first line of file
 sequenceBlock = "" # container for sequence
 sequences = [] # list for sequence blocks, 50 of them this time around
 while line:
 # if we hit >Rosalind_#### line
 if line[0] == ">":
 # make sure there's something to store
 if sequenceBlock:
 # print("sequence block: " + sequenceBlock + "\n")
 sequences.append(sequenceBlock) 
 sequenceBlock = "" # reset block
 # else just add it to the sequence block
 else:
 sequenceBlock = sequenceBlock + line
 # get the next line
 line = in_file.readline().rstrip() # get rid of any trailing whitespaces
 # add in the last sequence
 sequences.append(sequenceBlock)
 # when we're done, return this list
 return sequences
 in_file.close()
 except IOError:
 print("Cannot open file! See example.txt\n")
 return None

Question 2

I think primarily refactoring and readability are the issues here. It's quite hard to read what's going on here. In part, that's because you're doing a complex algorithm, but there's a lot more you could be doing to help people follow what your code does.

You have excessively long functions. buildAlgorithm is the entire algorithm in one function. Ideally, each function would have one specific job to do. Even if each function you write is only used once, it's worthwhile. Having a collection of smaller functions allows you to test easier, make changes easier and extend functionality.

However you definitely have repeat code. Your sequence comparisons should be a function. This isn't the right syntax, but you should refactor to something like this:

def compare():
 checkSequence = overlapSequences(checkSequence, val, i, 0)
 keySequence = checkSequence[:X] # new keySequence
 del sequences[idx] # delete that sequence
 idx = 0 
 # it means we didn't accomplish anything, we're done
 if beforeSequence == keySequence:
 return checkSequence, sequences
 else:
 reverseKey = keySequence[::-1]

Which could then be called like this:

if flag == 0: # compare right part of sequence
 if compareOverlap(checkSequence, val, i, 0): # grows O(n)
 compare()
else:
 if compareOverlap(checkSequence, val, i, 1): # grows O(n)
 compare(reverse=True)

This makes code much more readable too. You tend to use a lot of comments, but good function names and refactoring would help readability much more. As would good variable names. Part of why I didn't try to make your function valid is I would've needed to try discern what val, i, X and idx were. All names that mean practically the same thing, and tell me nothing about their use.

X = 10 # the length we're testing

Should be

TEST_LENGTH = 10

Now you don't need a comment, and every time this value is used you'll have a clearer understanding of what it's meant to be.

score 2 · Answer 1 · 2016-08-10 10:12:40Z

I think primarily refactoring and readability are the issues here. It's quite hard to read what's going on here. In part, that's because you're doing a complex algorithm, but there's a lot more you could be doing to help people follow what your code does.

You have excessively long functions. buildAlgorithm is the entire algorithm in one function. Ideally, each function would have one specific job to do. Even if each function you write is only used once, it's worthwhile. Having a collection of smaller functions allows you to test easier, make changes easier and extend functionality.

However you definitely have repeat code. Your sequence comparisons should be a function. This isn't the right syntax, but you should refactor to something like this:

def compare():
 checkSequence = overlapSequences(checkSequence, val, i, 0)
 keySequence = checkSequence[:X] # new keySequence
 del sequences[idx] # delete that sequence
 idx = 0 
 # it means we didn't accomplish anything, we're done
 if beforeSequence == keySequence:
 return checkSequence, sequences
 else:
 reverseKey = keySequence[::-1]

Which could then be called like this:

if flag == 0: # compare right part of sequence
 if compareOverlap(checkSequence, val, i, 0): # grows O(n)
 compare()
else:
 if compareOverlap(checkSequence, val, i, 1): # grows O(n)
 compare(reverse=True)

This makes code much more readable too. You tend to use a lot of comments, but good function names and refactoring would help readability much more. As would good variable names. Part of why I didn't try to make your function valid is I would've needed to try discern what val, i, X and idx were. All names that mean practically the same thing, and tell me nothing about their use.

X = 10 # the length we're testing

Should be

TEST_LENGTH = 10

Now you don't need a comment, and every time this value is used you'll have a clearer understanding of what it's meant to be.

Stack Exchange Network

Merging Overlapping DNA segments into 1 complete segment

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Merging Overlapping DNA segments into 1 complete segment

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions