Spearman correlations between Numpy array and every Pandas DataFrame row

Question 1

I want to efficiently calculate Spearman correlations between a Numpy array and every Pandas DataFrame row:

import pandas as pd
import numpy as np
from scipy.stats import spearmanr
n_rows = 2500
cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
v = np.random.random(size=len(cols))
corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
corr = pd.Series(corr)

For now, the calculation time of corr is:

%timeit corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
>> 1.26 s ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I found another good approach but it calculates only Pearson correlations:

%timeit df.corrwith(pd.Series(v, index=df.columns), axis=1)
>> 466 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a way to calculate Spearman correlations faster?

Question 2

Since Spearman correlation is the Pearson correlation coefficient of the ranked version of the variables, it is possible to do the following:

Replace values in df rows with their ranks using pandas.DataFrame.rank() function.
Convert v to pandas.Seriesand use pandas.Series.rank() function to get ranks.

Use pandas.corrwith() function to calculate Spearman correlation - Pearson correlation on ranked data.

import pandas as pd
import numpy as np
from scipy.stats import spearmanr
n_rows = 2500
cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
v = np.random.random(size=len(cols))
# original implementation
corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
corr = pd.Series(corr)
# modified implementation
df1 = df.rank(axis=1)
v1 = pd.Series(v, index=df.columns).rank()
corr1 = df1.corrwith(v1, axis=1)

Calculation time of the modified version:

 %%timeit
 v1 = pd.Series(v, index=df.columns).rank()
 df1 = df.rank(axis=1)
 corr1 = df1.corrwith(v1,axis=1)
 >> 495 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Checking corr and corr1 for equality proves that the results are the same:

 print(corr.var()-corr1.var(), corr.mean()-corr1.mean(), corr.median()-corr1.median())
 >> (0.0, 0.0, 0.0)

Roman Prilepskiy Roman Prilepskiy 1018 bronze badges · Answer 1 · 2018-08-08 16:25:50Z

Since Spearman correlation is the Pearson correlation coefficient of the ranked version of the variables, it is possible to do the following:

Replace values in df rows with their ranks using pandas.DataFrame.rank() function.
Convert v to pandas.Seriesand use pandas.Series.rank() function to get ranks.

Use pandas.corrwith() function to calculate Spearman correlation - Pearson correlation on ranked data.

import pandas as pd
import numpy as np
from scipy.stats import spearmanr
n_rows = 2500
cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
v = np.random.random(size=len(cols))
# original implementation
corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
corr = pd.Series(corr)
# modified implementation
df1 = df.rank(axis=1)
v1 = pd.Series(v, index=df.columns).rank()
corr1 = df1.corrwith(v1, axis=1)

Calculation time of the modified version:

 %%timeit
 v1 = pd.Series(v, index=df.columns).rank()
 df1 = df.rank(axis=1)
 corr1 = df1.corrwith(v1,axis=1)
 >> 495 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Checking corr and corr1 for equality proves that the results are the same:

 print(corr.var()-corr1.var(), corr.mean()-corr1.mean(), corr.median()-corr1.median())
 >> (0.0, 0.0, 0.0)

Stack Exchange Network

Spearman correlations between Numpy array and every Pandas DataFrame row

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Spearman correlations between Numpy array and every Pandas DataFrame row

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions