Analyzing pair by pair DNA sequences with for loops in Python

Question 1

Here is a code I would need to make far faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

import numpy as np
import time
import random
def calculate_distance(genom1, genom2): 
 # in : two strings (sequences)
 # out : score between those 2 strings
 # example : 'AATTGCAT' compared to 'AAAACGTC'
 # => 'AA' and 'AA' = 2
 # => 'TT' and 'AA' = 0
 # => 'GC' and 'CG' = 2
 # => 'AT' and 'TC' = 1
 score = 0 
 for i in range(0, len(genom1), 2):
 if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
 if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
 score += 2
 else :
 score += 1
 return score
def build_matrix(sequences, N):
 # in : list of lists 
 # out : matrix of scores between each pair of sequences 
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j]) 
 return matrix
def test(len_seq, N):
 sequences = []
 for i in range(N):
 sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))
 start = time.clock()
 matrix = build_matrix(sequences, N)
 elapsed = time.clock() - start
 print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
 return matrix
test(10**6, 2)

Returns :

Run time for 2 sequences of 1000000 characters : computed in 2.742817 seconds

Question 2

How/where do you keep the data. How are you loading it?

Question 3

Maybe you should give it a try? en.wikipedia.org/wiki/Trie

Question 4

@enedil: "Maybe you should give it a trie?" – FTFY!

Question 5

I notice something small, well you know the corner cases. You know that a DNA string will yield a score of that is the same as length. Obviously ^^. So don't check those cases.

You make 4 iterations in your code. While only one is interesting.

A big improvement in runtime could be to change:

def build_matrix(sequences, N):
 # in : list of lists 
 # out : matrix of scores between each pair of sequences 
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j]) 
 return matrix

to this:

def build_matrix(sequences, N):
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i+1, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j])
 return matrix

If you could load your data as:

def get_test_arr():
 parts = ['0', 'A', 'T', 'C', 'G']
 return ["".join(
 sorted([random.choice(parts) for _ in range(2)]))
 for _ in range(5*10**5)]

you could make them np arrays as:

a = np.array(get_test_arr(), dtype=object)
b = np.array(get_test_arr(), dtype=object)

and have your distance calculation as:

def calculate_distance1(genom1, genom2):
 score = 0
 for g1, g2 in zip(genom1, genom2):
 if (g1[0] in g2) or (g1[1] in g2):
 score += 2 if g1 == g2 else 1
 return score

In the problem you've provided each induvidual pairs order does not seem to matter, so load them as such. If you do that, you don't have to sort the in the calculation function.

If you don't, you'll resort them later. Again and again. Them more samples you use, there more time is spent doing this.

Times:

Your calculation function

0.575332 s for one loop in 100 iterations.

The my solution

0.10583 s for one loop in 100 iterations.

Question 6

could this be reduced to a list comprehension? Something like [[calculate_distance(sequences[i], sequences[j]) for i in range(N)] for j in range(i+1, N)]? That's probably not correct, but I feel like it should be possible.

Question 7

Yes indeed, my personal choice would be to not do that. But, I would go for a dict comprehension in a list comprehension.. I don't know, it's not obscure to me. While my eyes confuses brackets for list in lists comprehensions.

Question 8

Obviously this is an improvement, but doesn't it only save 1/N of the iterations, which is likely to be vanishingly small?

Question 9

No, now I realize what you meant. It save one per iteration per sample, and it saves the last one.

Question 10

But good that you pointed out that I wasn't so clear about that in my answer.

Simon Simon 1,0526 silver badges15 bronze badges · Accepted Answer · 2016-12-06 15:57:17Z

I notice something small, well you know the corner cases. You know that a DNA string will yield a score of that is the same as length. Obviously ^^. So don't check those cases.

You make 4 iterations in your code. While only one is interesting.

A big improvement in runtime could be to change:

def build_matrix(sequences, N):
 # in : list of lists 
 # out : matrix of scores between each pair of sequences 
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j]) 
 return matrix

to this:

def build_matrix(sequences, N):
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i+1, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j])
 return matrix

If you could load your data as:

def get_test_arr():
 parts = ['0', 'A', 'T', 'C', 'G']
 return ["".join(
 sorted([random.choice(parts) for _ in range(2)]))
 for _ in range(5*10**5)]

you could make them np arrays as:

a = np.array(get_test_arr(), dtype=object)
b = np.array(get_test_arr(), dtype=object)

and have your distance calculation as:

def calculate_distance1(genom1, genom2):
 score = 0
 for g1, g2 in zip(genom1, genom2):
 if (g1[0] in g2) or (g1[1] in g2):
 score += 2 if g1 == g2 else 1
 return score

In the problem you've provided each induvidual pairs order does not seem to matter, so load them as such. If you do that, you don't have to sort the in the calculation function.

If you don't, you'll resort them later. Again and again. Them more samples you use, there more time is spent doing this.

Times:

Your calculation function

0.575332 s for one loop in 100 iterations.

The my solution

0.10583 s for one loop in 100 iterations.

could this be reduced to a list comprehension? Something like [[calculate_distance(sequences[i], sequences[j]) for i in range(N)] for j in range(i+1, N)]? That's probably not correct, but I feel like it should be possible.
Yes indeed, my personal choice would be to not do that. But, I would go for a dict comprehension in a list comprehension.. I don't know, it's not obscure to me. While my eyes confuses brackets for list in lists comprehensions.
Obviously this is an improvement, but doesn't it only save 1/N of the iterations, which is likely to be vanishingly small?
No, now I realize what you meant. It save one per iteration per sample, and it saves the last one.
But good that you pointed out that I wasn't so clear about that in my answer.

Stack Exchange Network

Analyzing pair by pair DNA sequences with for loops in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Analyzing pair by pair DNA sequences with for loops in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions