Understanding the permutation test

Question 1

I'm attempting to optimize the performance of the permutation test implemented in scipy.stats. My dataset consists of 500,000 observations, each associated with 2,000 binary covariates. I've applied approximately 10,000 different filters to these observations, necessitating the computation of 20,000,000 permutation tests across the 500,000 observations.

My approach involves generating a 2D boolean array using CuPy (500,000 x 2,000). I then element-wise multiply this 2D 'covariate matrix' by the 1D vector of test statistics (float values) and sum along the appropriate axis to obtain a 'reference statistic' for each covariate. To perform the permutation test, I compare this reference statistic to the distribution of statistics obtained from n permutations of the 2D covariate matrix.

In an effort to improve computational efficiency, I implemented an alternative method for generating the permuted covariate matrix. Instead of using cp.random.permutation(covariates), I'm using cp.random.rand(*covariates.shape) < proportion_of_ones, which is 2-3 times faster. I hypothesized that this approach would be valid because, while the number of 'ones' in each permuted matrix might vary, the distribution should be symmetric around the original number of ones (assuming a sufficiently large n).

I validated this hypothesis empirically and found that in many cases, my assumption appeared to hold, with the estimated p-values often differing by less than 0.01 between the two methods. However, when I introduced a slight modification to the distribution of the test statistic, my assumption no longer held, and the results from the two methods diverged significantly. I'm unable to determine the cause of this discrepancy and find myself at an impasse.

Here is my code for reference (replace cp for np and it should all work too):

from scipy.stats import ttest_ind
import pandas as pd
import numpy as np
import cupy as cp
from tqdm import tqdm
def test_permu_1(
 score: cp.ndarray, state: cp.ndarray, permu: int = 1000000
) -> cp.ndarray:
 value = cp.abs(cp.sum(score * state, axis=1))
 fraction_affected = cp.sum(state, axis=1) / state.shape[1]
 stats = []
 for _ in tqdm(range(permu), total=permu, desc="Permuting 1", ncols=120):
 mask = cp.random.rand(*state.shape, dtype=cp.float32)
 mask = mask < fraction_affected[:, None]
 stats.append(cp.sum(score * mask, axis=1))
 stats = cp.array(stats)
 return cp.sum(value < cp.abs(stats), axis=0) / permu
def test_permu_2(
 score: cp.ndarray, state: cp.ndarray, permu: int = 1000000
) -> cp.ndarray:
 value = cp.abs(cp.sum(score * state, axis=1))
 # NOTE, cp.random.permutation doesn't support axis = 1
 state = state.transpose()
 score = score[:, None]
 stats = []
 for _ in tqdm(range(permu), total=permu, desc="Permuting 2", ncols=120):
 mask = cp.random.permutation(state)
 stats.append((score * mask).sum(axis=0))
 stats = cp.array(stats)
 return cp.sum(value < cp.abs(stats), axis=0) / permu
def ttest_ind_0(score: cp.ndarray, state: cp.ndarray) -> cp.ndarray:
 np_score = score.get()
 np_state = state.get()
 ttests = []
 for i in range(np_state.shape[0]):
 tp = np_score[np_state[i]]
 tn = np_score[~np_state[i]]
 pv = ttest_ind(tp, tn, equal_var=False)
 ttests.append(pv.pvalue) # type: ignore
 return np.array(ttests)
def dumpy(p0: np.ndarray, p1: np.ndarray, p2: np.ndarray) -> pd.DataFrame:
 df = pd.DataFrame(
 {
 "p0": p0s,
 "p1": p1s,
 "p2": p2s,
 "-": ["."] * ysize,
 "d10": p0s - p1s,
 "d12": p1s - p2s,
 }
 ).T
 with pd.option_context("float_format", "{:.5f}".format):
 print(df)
 return df
xsize = 100000
ysize = 18
# Pull scores from a normal distribution
score = cp.random.randn(xsize) + 10
state = cp.random.rand(ysize, xsize, dtype=cp.float32) < 0.001
# Perform the tests
permutations = 10000
p0s = ttest_ind_0(score, state)
p1s = test_permu_1(score, state, permutations).get()
p2s = test_permu_2(score, state, permutations).get()
dumpy(p0s, p1s, p2s)
print("Done!")

Note the line that reads score = cp.random.randn(xsize) + 10 ... If I remove the + 10, it all 'works', as in my assumption seems to hold up. With that line, the two different approaches 'diverge'...

Here is an example run WITHOUT the + 10:

Permuting 1: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [00:30<00:00, 329.59it/s]
Permuting 2: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [01:08<00:00, 146.35it/s]
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
p0 0.58190 0.27232 0.34934 0.21264 0.85781 0.09202 0.20862 0.24606 0.01390 0.59029 0.28032 0.71163 0.96128 0.86587 0.26078 0.77700 0.70671 0.62766
p1 0.61460 0.22130 0.32700 0.20900 0.82670 0.05520 0.20360 0.25450 0.02340 0.59310 0.30230 0.71890 0.93790 0.89820 0.27080 0.79350 0.71250 0.60850
p2 0.61070 0.22810 0.32330 0.19780 0.83530 0.05580 0.20810 0.25300 0.02360 0.58910 0.29630 0.72530 0.93780 0.89760 0.26720 0.79580 0.72110 0.60660
- . . . . . . . . . . . . . . . . . .
d10 -0.03270 0.05102 0.02234 0.00364 0.03111 0.03682 0.00502 -0.00844 -0.00950 -0.00281 -0.02198 -0.00727 0.02338 -0.03233 -0.01002 -0.01650 -0.00579 0.01916
d12 0.00390 -0.00680 0.00370 0.01120 -0.00860 -0.00060 -0.00450 0.00150 -0.00020 0.00400 0.00600 -0.00640 0.00010 0.00060 0.00360 -0.00230 -0.00860 0.00190

and here are two runs with the + 10. Note, the two runs are replicates of the same dummy data generated at the start (I just ran the permutation steps twice in debug mode). I wanted to see if it was just due to higher variability in the smaller range of data caused by adding 10 to values in the range -1 to 1, but the permutation results are consistent within each method. I hate being this clueless!

Permuting 1: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [00:30<00:00, 329.59it/s]
Permuting 2: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [01:08<00:00, 146.27it/s]
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
p0 0.29352 0.25426 0.33900 0.47577 0.30116 0.97216 0.81252 0.67735 0.57792 0.98254 0.38642 0.01012 0.56308 0.60787 0.46944 0.10419 0.57390 0.65886
p1 0.44620 0.45500 0.52370 0.46790 0.53630 0.50050 0.48880 0.51110 0.47350 0.49980 0.45840 0.59830 0.52310 0.46390 0.52090 0.55280 0.51700 0.48080
p2 0.13040 0.15300 0.80300 0.26190 0.87330 0.51890 0.40320 0.68480 0.28030 0.49180 0.17170 0.99550 0.69950 0.31450 0.75990 0.93470 0.73290 0.31700
- . . . . . . . . . . . . . . . . . .
d10 -0.15268 -0.20074 -0.18470 0.00787 -0.23514 0.47166 0.32372 0.16625 0.10442 0.48274 -0.07198 -0.58818 0.03998 0.14397 -0.05146 -0.44861 0.05690 0.17806
d12 0.31580 0.30200 -0.27930 0.20600 -0.33700 -0.01840 0.08560 -0.17370 0.19320 0.00800 0.28670 -0.39720 -0.17640 0.14940 -0.23900 -0.38190 -0.21590 0.16380
Permuting 1: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [00:30<00:00, 329.57it/s]
Permuting 2: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [01:08<00:00, 145.83it/s]
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
p0 0.29352 0.25426 0.33900 0.47577 0.30116 0.97216 0.81252 0.67735 0.57792 0.98254 0.38642 0.01012 0.56308 0.60787 0.46944 0.10419 0.57390 0.65886
p1 0.45250 0.45430 0.53450 0.47370 0.53260 0.49320 0.48450 0.51640 0.47310 0.49860 0.45260 0.59260 0.51230 0.46290 0.51770 0.55190 0.51670 0.47340
p2 0.12270 0.14980 0.79630 0.25370 0.87600 0.51530 0.40710 0.67740 0.28180 0.48010 0.17010 0.99530 0.70230 0.31050 0.77380 0.93760 0.73010 0.31800
- . . . . . . . . . . . . . . . . . .
d10 -0.15898 -0.20004 -0.19550 0.00207 -0.23144 0.47896 0.32802 0.16095 0.10482 0.48394 -0.06618 -0.58248 0.05078 0.14497 -0.04826 -0.44771 0.05720 0.18546
d12 0.32980 0.30450 -0.26180 0.22000 -0.34340 -0.02210 0.07740 -0.16100 0.19130 0.01850 0.28250 -0.40270 -0.19000 0.15240 -0.25610 -0.38570 -0.21340 0.15540

I just added a factor of 10 to the value of n and re-ran (with just 3 conditions for speed):

Permuting 1: 100%|███████████████████████████████████████████████████████████████| 10000/10000 [00:54<00:00, 185.17it/s]
Permuting 2: 100%|████████████████████████████████████████████████████████████████| 10000/10000 [03:20<00:00, 49.85it/s]
 0 1 2
p0 0.84096 0.06036 0.28752
p1 0.48310 0.43160 0.54530
p2 0.43590 0.03200 0.85530
- . . .
d10 0.35786 -0.37124 -0.25778
d12 0.04720 0.39960 -0.31000

Question 2

Your other implementation of the test sounds like a Monte Carlo hypothesis test rather than a permutation test.

The null hypothesis of the permutation test is essentially that there is nothing special about the order of the observed sample, so the null distribution is generated by reordering the observations at random.

The null hypothesis of the Monte Carlo test is that your covariates are a sequence of observations from a Bernoulli distribution (with probability of success estimated from the observed data), so the null distribution is generated by drawing from this distribution.

Since you are testing different null hypotheses, you may get different results. (Incidentally, the arguments you might make about the implications are also different, since the permutation test follows a "randomization" model of inference whereas the Monte Carlo test follows a "population" model of inference - see 3.3 of "Why permutation tests are superior to t and F tests in biomedical research".)

Consider whether Pearson's test would suffice (rather than a permutation test) in your application. It should accept CuPy arrays by default as of SciPy 1.14, and produces p-values very similar to the permutation test.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(2348934349)
n = 1000
x = stats.bernoulli.rvs(p=0.5, size=n)
y = rng.random(size=n)
def statistic(x, y, axis=-1):
 return np.sum(x*y, axis=axis)
ref = stats.permutation_test((x,), lambda x, axis: statistic(x, y, axis), 
 permutation_type='pairings', alternative='greater')
res1 = stats.monte_carlo_test((x,), [stats.bernoulli(p=np.mean(x)).rvs], 
 lambda x, axis: statistic(x, y, axis), 
 alternative='greater')
res2 = stats.pearsonr(x, y, alternative='greater')
plt.hist(ref.null_distribution, density=True, alpha=0.5, label='Permutation test')
plt.hist(res1.null_distribution, density=True, alpha=0.5, label='Monte Carlo test')
plt.legend()
print(ref.pvalue, res1.pvalue, res2.pvalue)
# 0.7661 0.6487 0.7621940403555396

enter image description here

Assuming there is not a mistake in the code, why the two would produce different p-values in your case (despite testing similar null hypotheses at first glance) is even more of a statistics question than what I've answered here. Intuitively, the null distribution of the Monte Carlo test might be more spread out since the number of 1 covariates - not just the order - is variable. That seems to be confirmed by the code above. Consider describing the mathematics and posting that part at Cross Validated if you need more about that.

Question 3

I think there is an issue with what is being computed by Monty Carlo... Not sure though: gist.github.com/dbolser/549f8ff253e77690053599cd8c94e7ba

Matt Haberland 4,4781 gold badge12 silver badges34 bronze badges · Accepted Answer · 2024-10-12 04:35:26Z

Your other implementation of the test sounds like a Monte Carlo hypothesis test rather than a permutation test.

The null hypothesis of the permutation test is essentially that there is nothing special about the order of the observed sample, so the null distribution is generated by reordering the observations at random.

The null hypothesis of the Monte Carlo test is that your covariates are a sequence of observations from a Bernoulli distribution (with probability of success estimated from the observed data), so the null distribution is generated by drawing from this distribution.

Since you are testing different null hypotheses, you may get different results. (Incidentally, the arguments you might make about the implications are also different, since the permutation test follows a "randomization" model of inference whereas the Monte Carlo test follows a "population" model of inference - see 3.3 of "Why permutation tests are superior to t and F tests in biomedical research".)

Consider whether Pearson's test would suffice (rather than a permutation test) in your application. It should accept CuPy arrays by default as of SciPy 1.14, and produces p-values very similar to the permutation test.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
rng = np.random.default_rng(2348934349)
n = 1000
x = stats.bernoulli.rvs(p=0.5, size=n)
y = rng.random(size=n)
def statistic(x, y, axis=-1):
 return np.sum(x*y, axis=axis)
ref = stats.permutation_test((x,), lambda x, axis: statistic(x, y, axis), 
 permutation_type='pairings', alternative='greater')
res1 = stats.monte_carlo_test((x,), [stats.bernoulli(p=np.mean(x)).rvs], 
 lambda x, axis: statistic(x, y, axis), 
 alternative='greater')
res2 = stats.pearsonr(x, y, alternative='greater')
plt.hist(ref.null_distribution, density=True, alpha=0.5, label='Permutation test')
plt.hist(res1.null_distribution, density=True, alpha=0.5, label='Monte Carlo test')
plt.legend()
print(ref.pvalue, res1.pvalue, res2.pvalue)
# 0.7661 0.6487 0.7621940403555396

enter image description here

Assuming there is not a mistake in the code, why the two would produce different p-values in your case (despite testing similar null hypotheses at first glance) is even more of a statistics question than what I've answered here. Intuitively, the null distribution of the Monte Carlo test might be more spread out since the number of 1 covariates - not just the order - is variable. That seems to be confirmed by the code above. Consider describing the mathematics and posting that part at Cross Validated if you need more about that.

I think there is an issue with what is being computed by Monty Carlo... Not sure though: gist.github.com/dbolser/549f8ff253e77690053599cd8c94e7ba

CollectivesTM on Stack Overflow

Understanding the permutation test

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related