Pandas/Numpy statistical odds ratio test

Question 1

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neither site x & y (b) altered in site x, not in y (c) altered in y, not in x (d) altered in both. I'd also like to calculate the fisher exact test to determine statistical significance. The scipy function fisher_exact can calculate both of these (see below).

#here's a sample of my original dataframe
sample_id_no var_col
 0 258.0
 1 -24.0
 2 -150.0
 3 149.0
 4 108.0
 5 -126.0
 6 -83.0
 7 2.0
 8 -177.0
 9 -171.0
 10 -7.0
 11 -377.0
 12 -272.0
 13 66.0
 14 -13.0
 15 -7.0
 16 0.0
 17 189.0
 18 7.0
 13 -21.0
 19 80.0
 20 -14.0
 21 -76.0
 3 83.0
 22 -182.0

import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools
#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})
#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)
odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))
#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table): 
 alpha_site, beta_site = pairs
 oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
 return ([oddsratio, pvalue])

This code runs very slow, especially when used on large datasets. In my actual dataset, I have around 700k variable pairs. Since the getOddsRatio() function is applied to each pair individually, it is definitely the main source of the slowness. Is there a more efficient solution?

Question 2

It's unsurprising that this is slow, given that you have combinatoric blow-up to 276 rows from only 25 inputs, and you're using apply(), and the inner operation is itself complicated. It's conceivable that this could be sped up, but I'm not sure it's even worth it and instead I doubt the validity of the operation being run.

For your output, 99.3% of output rows have an odds-ratio of exactly 0 and a p-value of exactly 1. For the other 0.7% of rows, the odds-ratio is infinite. Are you absolutely sure that this makes sense? If so, there are some shortcuts you can run where you can work backward from the very, very, very few non-trivial output rows, observing that they require a crosstab having nonzero diagonals, and then figure out how to narrow the computation to only find those cases. If it doesn't make sense, well, back to the drawing board.

Reinderien Reinderien 71k5 gold badges76 silver badges256 bronze badges · Answer 1 · 2025-01-28 23:59:08Z

It's unsurprising that this is slow, given that you have combinatoric blow-up to 276 rows from only 25 inputs, and you're using apply(), and the inner operation is itself complicated. It's conceivable that this could be sped up, but I'm not sure it's even worth it and instead I doubt the validity of the operation being run.

For your output, 99.3% of output rows have an odds-ratio of exactly 0 and a p-value of exactly 1. For the other 0.7% of rows, the odds-ratio is infinite. Are you absolutely sure that this makes sense? If so, there are some shortcuts you can run where you can work backward from the very, very, very few non-trivial output rows, observing that they require a crosstab having nonzero diagonals, and then figure out how to narrow the computation to only find those cases. If it doesn't make sense, well, back to the drawing board.

Stack Exchange Network

Pandas/Numpy statistical odds ratio test

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

Pandas/Numpy statistical odds ratio test

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related

Hot Network Questions