3
\$\begingroup\$

Here is a code I would need to make far faster: it is made to analyse more than 3000 DNA sequences of more than 100 000 characters each.

The matter is I have to compare it by pair, so that would be around 3000*2999/2 = 4498500 calculation and each takes 1 to 2 seconds...

Could you think of another way of doing it ? I've seen that there are algorithms that are used to go faster (like Boyer-Moore), but could we apply it here ? Or replace the for loops with something faster ? Any idea would be more than welcome.

import numpy as np
import time
import random
def calculate_distance(genom1, genom2): 
 # in : two strings (sequences)
 # out : score between those 2 strings
 # example : 'AATTGCAT' compared to 'AAAACGTC'
 # => 'AA' and 'AA' = 2
 # => 'TT' and 'AA' = 0
 # => 'GC' and 'CG' = 2
 # => 'AT' and 'TC' = 1
 score = 0 
 for i in range(0, len(genom1), 2):
 if (genom1[i] in genom2[i:i+2]) or (genom1[i+1] in genom2[i:i+2]):
 if sorted(genom1[i:i+2]) == sorted(genom2[i:i+2]):
 score += 2
 else :
 score += 1
 return score
def build_matrix(sequences, N):
 # in : list of lists 
 # out : matrix of scores between each pair of sequences 
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j]) 
 return matrix
def test(len_seq, N):
 sequences = []
 for i in range(N):
 sequences.append(''.join(random.choice(['0','A', 'T', 'C', 'G']) for x in range(len_seq)))
 start = time.clock()
 matrix = build_matrix(sequences, N)
 elapsed = time.clock() - start
 print('Run time for ' + str(N) + ' sequences of ' + str(len_seq) + ' characters : computed in ' + str(elapsed) + ' seconds')
 return matrix
test(10**6, 2)

Returns :

Run time for 2 sequences of 1000000 characters : computed in 2.742817 seconds
200_success
146k22 gold badges190 silver badges479 bronze badges
asked Dec 6, 2016 at 14:44
\$\endgroup\$
3
  • 1
    \$\begingroup\$ How/where do you keep the data. How are you loading it? \$\endgroup\$ Commented Dec 6, 2016 at 15:38
  • \$\begingroup\$ Maybe you should give it a try? en.wikipedia.org/wiki/Trie \$\endgroup\$ Commented Dec 6, 2016 at 16:54
  • \$\begingroup\$ @enedil: "Maybe you should give it a trie?" – FTFY! \$\endgroup\$ Commented Dec 11, 2016 at 14:28

1 Answer 1

1
\$\begingroup\$

I notice something small, well you know the corner cases. You know that a DNA string will yield a score of that is the same as length. Obviously ^^. So don't check those cases.

You make 4 iterations in your code. While only one is interesting.

A big improvement in runtime could be to change:

def build_matrix(sequences, N):
 # in : list of lists 
 # out : matrix of scores between each pair of sequences 
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j]) 
 return matrix

to this:

def build_matrix(sequences, N):
 matrix = np.zeros((N, N))
 for i in range(N):
 for j in range(i+1, N):
 matrix[i][j] = calculate_distance(sequences[i], sequences[j])
 return matrix

If you could load your data as:

def get_test_arr():
 parts = ['0', 'A', 'T', 'C', 'G']
 return ["".join(
 sorted([random.choice(parts) for _ in range(2)]))
 for _ in range(5*10**5)]

you could make them np arrays as:

a = np.array(get_test_arr(), dtype=object)
b = np.array(get_test_arr(), dtype=object)

and have your distance calculation as:

def calculate_distance1(genom1, genom2):
 score = 0
 for g1, g2 in zip(genom1, genom2):
 if (g1[0] in g2) or (g1[1] in g2):
 score += 2 if g1 == g2 else 1
 return score

In the problem you've provided each induvidual pairs order does not seem to matter, so load them as such. If you do that, you don't have to sort the in the calculation function.

If you don't, you'll resort them later. Again and again. Them more samples you use, there more time is spent doing this.

Times:


Your calculation function

0.575332 s for one loop in 100 iterations.

The my solution

0.10583 s for one loop in 100 iterations.

answered Dec 6, 2016 at 15:57
\$\endgroup\$
5
  • \$\begingroup\$ could this be reduced to a list comprehension? Something like [[calculate_distance(sequences[i], sequences[j]) for i in range(N)] for j in range(i+1, N)]? That's probably not correct, but I feel like it should be possible. \$\endgroup\$ Commented Dec 6, 2016 at 16:04
  • \$\begingroup\$ Yes indeed, my personal choice would be to not do that. But, I would go for a dict comprehension in a list comprehension.. I don't know, it's not obscure to me. While my eyes confuses brackets for list in lists comprehensions. \$\endgroup\$ Commented Dec 6, 2016 at 16:09
  • \$\begingroup\$ Obviously this is an improvement, but doesn't it only save 1/N of the iterations, which is likely to be vanishingly small? \$\endgroup\$ Commented Dec 6, 2016 at 16:22
  • \$\begingroup\$ No, now I realize what you meant. It save one per iteration per sample, and it saves the last one. \$\endgroup\$ Commented Dec 6, 2016 at 16:42
  • \$\begingroup\$ But good that you pointed out that I wasn't so clear about that in my answer. \$\endgroup\$ Commented Dec 6, 2016 at 16:47

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.