7
\$\begingroup\$

Given a set of discrete locations (e.g. "sites") that are pairwise related in some categorical ways (e.g. general proximity) and contains local level data (e.g. population size), I wish to efficiently compute the mean correlation coefficients between local level data on pairwise locations that are characterized by the same relationships.

For example, I assumed 100 sites and randomized their pairwise relations using values 1 to 25, yielding the triangular matrix relations:

import numpy as np
sites = 100
categ = 25
relations = np.random.randint(low=1, high=categ+1, size=(sites, sites))
relations = np.triu(relations) # set relation_ij = relation_ji
np.fill_diagonal(relations, 0) # ignore self-relation

I also have 5000 replicates of simulation results on each site:

sims = 5000
res = np.round(np.random.rand(sites, sims),1)

To compute the mean pairwise correlation for each specific relation category, I first calculated for each relation category i the correlation coefficient rho[j] between the simulation results res of each unique site pairs j, and then taking the average across all possible pairs with relation i:

rho_list = np.ones(categ)*99
for i in range(1, categ+1):
 idr = np.transpose(np.where(relations == i)) # pairwise site indices of the same relation category
 comp = np.vstack([res[idr[:,0]].ravel(), res[idr[:,1]].ravel()]) # pairwise comparisons of simulation results from the same relation category
 comp_uniq = np.reshape(comp.T, (len(idr), res.shape[1], -1)) # reshape above into pairwise comparisons of simulation results between unique site pairs
 rho = np.ones(len(idr))*99 # correlation coefficients of all unique site pairs of current relation category
 for j in range(len(idr)): # loop through unique site pairs
 comp_uniq_s = comp_uniq[j][np.all(comp_uniq!=0, axis=2)[j]].T # shorten comparisons by removing pairs with zero-valued result
 rho[j] = np.corrcoef(comp_uniq_s[0], comp_uniq_s[1])[0,1]
 rho_list[i-1] = np.nanmean(rho)

Although this script works, but once I increase sites = 400, then the entire computation can take more than 6 hrs to finish, which leads me to question my use of array functions. What is the reason for this poor performance? And how can I optimize the algorithm?

asked Jul 31, 2017 at 17:52
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

What is the reason for this poor performance? And how can I optimize the algorithm?

You mentioned that sites=100 is a "fast" run, and sites=400 is a "slow" six-hour run, without posting profiling data. I don't know what "fast" is, but if it's around 22 minutes then you're just looking at quadratic growth of your data. I'm going to guess "fast" was less than 22 minutes, and then I'll speculate about your cache size. Perhaps you blew out your L3 (or L2) last-level cache. Which would suggest looking for a divide and conquer approach on overlapping neighborhoods, so each sub-problem is conveniently small enough to fit within cache.

Other than that, perhaps you can "cheat" by changing the problem, approximating results. Consider a fixed budget of 10^4 pairs for which you're willing to compute correlation coefficients. In the sites=400 case, could you maybe rank order pairs so the ones with negligible contribution sort to the bottom of the list, outside of budget? Then they would simply never be considered.

answered Aug 5, 2017 at 4:44
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.