6
\$\begingroup\$

I want to efficiently calculate Spearman correlations between a Numpy array and every Pandas DataFrame row:

import pandas as pd
import numpy as np
from scipy.stats import spearmanr
n_rows = 2500
cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
v = np.random.random(size=len(cols))
corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
corr = pd.Series(corr)

For now, the calculation time of corr is:

%timeit corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
>> 1.26 s ± 5.19 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I found another good approach but it calculates only Pearson correlations:

%timeit df.corrwith(pd.Series(v, index=df.columns), axis=1)
>> 466 ms ± 1.12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Is there a way to calculate Spearman correlations faster?

Jamal
35.2k13 gold badges134 silver badges238 bronze badges
asked Aug 7, 2018 at 5:23
\$\endgroup\$
0

1 Answer 1

4
\$\begingroup\$

Since Spearman correlation is the Pearson correlation coefficient of the ranked version of the variables, it is possible to do the following:

  1. Replace values in df rows with their ranks using pandas.DataFrame.rank() function.
  2. Convert v to pandas.Seriesand use pandas.Series.rank() function to get ranks.
  3. Use pandas.corrwith() function to calculate Spearman correlation - Pearson correlation on ranked data.

    import pandas as pd
    import numpy as np
    from scipy.stats import spearmanr
    n_rows = 2500
    cols = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
    df = pd.DataFrame(np.random.random(size=(n_rows, len(cols))), columns=cols)
    v = np.random.random(size=len(cols))
    # original implementation
    corr, _ = zip(*df.apply(lambda x: spearmanr(x,v), axis=1))
    corr = pd.Series(corr)
    # modified implementation
    df1 = df.rank(axis=1)
    v1 = pd.Series(v, index=df.columns).rank()
    corr1 = df1.corrwith(v1, axis=1)
    

Calculation time of the modified version:

 %%timeit
 v1 = pd.Series(v, index=df.columns).rank()
 df1 = df.rank(axis=1)
 corr1 = df1.corrwith(v1,axis=1)
 >> 495 ms ± 13.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Checking corr and corr1 for equality proves that the results are the same:

 print(corr.var()-corr1.var(), corr.mean()-corr1.mean(), corr.median()-corr1.median())
 >> (0.0, 0.0, 0.0)
answered Aug 8, 2018 at 16:25
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.