Optimized computation of pairwise correlations in Python

Question 1

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.

For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:

import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation

I also have 5000 replicates of simulation results on each site:

sims = 5000
res = np.round(np.random.rand(sites, sims),1)

To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:

rho_list = np.ones(categ)*99
for i in range(1, categ+1):
 idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
 comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
 comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
 rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
 for j in range(len(idr)): # loop through unique site pairs
 comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
 rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
 rho_list[i-1] = np.nanmean(rho)

Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?

Question 2

What is the reason for this poor performance? And how can I optimize the algorithm?

You mentioned that sites=100 is a "fast" run, and sites=400 is a "slow" six-hour run, without posting profiling data. I don't know what "fast" is, but if it's around 22 minutes then you're just looking at quadratic growth of your data. I'm going to guess "fast" was less than 22 minutes, and then I'll speculate about your cache size. Perhaps you blew out your L3 (or L2) last-level cache. Which would suggest looking for a divide and conquer approach on overlapping neighborhoods, so each sub-problem is conveniently small enough to fit within cache.

Other than that, perhaps you can "cheat" by changing the problem, approximating results. Consider a fixed budget of 10^4 pairs for which you're willing to compute correlation coefficients. In the sites=400 case, could you maybe rank order pairs so the ones with negligible contribution sort to the bottom of the list, outside of budget? Then they would simply never be considered.

J_H J_H 41.4k3 gold badges38 silver badges157 bronze badges · Answer 1 · 2017-08-05 04:44:16Z

What is the reason for this poor performance? And how can I optimize the algorithm?

You mentioned that sites=100 is a "fast" run, and sites=400 is a "slow" six-hour run, without posting profiling data. I don't know what "fast" is, but if it's around 22 minutes then you're just looking at quadratic growth of your data. I'm going to guess "fast" was less than 22 minutes, and then I'll speculate about your cache size. Perhaps you blew out your L3 (or L2) last-level cache. Which would suggest looking for a divide and conquer approach on overlapping neighborhoods, so each sub-problem is conveniently small enough to fit within cache.

Other than that, perhaps you can "cheat" by changing the problem, approximating results. Consider a fixed budget of 10^4 pairs for which you're willing to compute correlation coefficients. In the sites=400 case, could you maybe rank order pairs so the ones with negligible contribution sort to the bottom of the list, outside of budget? Then they would simply never be considered.

Stack Exchange Network

Optimized computation of pairwise correlations in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Optimized computation of pairwise correlations in Python

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions