I want to efficiently calculate Spearman correlations between a Numpy array and every Pandas DataFrame
row:
import pandas as pd
import numpy as np
from scipy.stats import spearmanr
n_rows = 2500
cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
v = np.random.random(size=len(cols))
corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
corr = pd.Series(corr)
For now, the calculation time of corr
is:
%timeit corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
>> 1.26 s ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I found another good approach but it calculates only Pearson correlations:
%timeit df.corrwith(pd.Series(v, index=df.columns), axis=1)
>> 466 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Is there a way to calculate Spearman correlations faster?
1 Answer 1
Since Spearman correlation is the Pearson correlation coefficient of the ranked version of the variables, it is possible to do the following:
- Replace values in
df
rows with their ranks usingpandas.DataFrame.rank()
function. - Convert
v
topandas.Series
and usepandas.Series.rank()
function to get ranks. Use
pandas.corrwith()
function to calculate Spearman correlation - Pearson correlation on ranked data.import pandas as pd import numpy as np from scipy.stats import spearmanr n_rows = 2500 cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g'] df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols) v = np.random.random(size=len(cols)) # original implementation corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1)) corr = pd.Series(corr) # modified implementation df1 = df.rank(axis=1) v1 = pd.Series(v, index=df.columns).rank() corr1 = df1.corrwith(v1, axis=1)
Calculation time of the modified version:
%%timeit
v1 = pd.Series(v, index=df.columns).rank()
df1 = df.rank(axis=1)
corr1 = df1.corrwith(v1,axis=1)
>> 495 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Checking corr
and corr1
for equality proves that the results are the same:
print(corr.var()-corr1.var(), corr.mean()-corr1.mean(), corr.median()-corr1.median())
>> (0.0, 0.0, 0.0)
Explore related questions
See similar questions with these tags.