5
\$\begingroup\$

I'm looking for a faster way to do odds ratio tests on a large dataset. I have about 1200 variables (see var_col) I want to test against each other for mutual exclusion/ co-occurrence. An odds ratio test is defined as (a * d) / (b * c)), where a, b c,d are number of samples with (a) altered in neither site x & y (b) altered in site x, not in y (c) altered in y, not in x (d) altered in both. I'd also like to calculate the fisher exact test to determine statistical significance. The scipy function fisher_exact can calculate both of these (see below).

#here's a sample of my original dataframe
sample_id_no var_col
 0 258.0
 1 -24.0
 2 -150.0
 3 149.0
 4 108.0
 5 -126.0
 6 -83.0
 7 2.0
 8 -177.0
 9 -171.0
 10 -7.0
 11 -377.0
 12 -272.0
 13 66.0
 14 -13.0
 15 -7.0
 16 0.0
 17 189.0
 18 7.0
 13 -21.0
 19 80.0
 20 -14.0
 21 -76.0
 3 83.0
 22 -182.0
import pandas as pd
import numpy as np
from scipy.stats import fisher_exact
import itertools
#create a dataframe with each possible pair of variable
var_pairs = pd.DataFrame(list(itertools.combinations(df.var_col.unique(),2) )).rename(columns = {0:'alpha_site', 1: 'beta_site'})
#create a cross-tab with samples and vars
sample_table = pd.crosstab(df.sample_id_no, df.var_col)
odds_ratio_results = var_pairs.apply(getOddsRatio, axis=1, args = (sample_table,))
#where the function getOddsRatio is defined as:
def getOddsRatio(pairs, sample_table): 
 alpha_site, beta_site = pairs
 oddsratio, pvalue = fisher_exact(pd.crosstab(sample_table[alpha_site] > 0, sample_table[beta_site] > 0))
 return ([oddsratio, pvalue])

This code runs very slow, especially when used on large datasets. In my actual dataset, I have around 700k variable pairs. Since the getOddsRatio() function is applied to each pair individually, it is definitely the main source of the slowness. Is there a more efficient solution?

Graipher
41.6k7 gold badges70 silver badges134 bronze badges
asked Feb 3, 2018 at 4:04
\$\endgroup\$

1 Answer 1

2
\$\begingroup\$

It's unsurprising that this is slow, given that you have combinatoric blow-up to 276 rows from only 25 inputs, and you're using apply(), and the inner operation is itself complicated. It's conceivable that this could be sped up, but I'm not sure it's even worth it and instead I doubt the validity of the operation being run.

For your output, 99.3% of output rows have an odds-ratio of exactly 0 and a p-value of exactly 1. For the other 0.7% of rows, the odds-ratio is infinite. Are you absolutely sure that this makes sense? If so, there are some shortcuts you can run where you can work backward from the very, very, very few non-trivial output rows, observing that they require a crosstab having nonzero diagonals, and then figure out how to narrow the computation to only find those cases. If it doesn't make sense, well, back to the drawing board.

answered Jan 28 at 23:59
\$\endgroup\$

Your Answer

Draft saved
Draft discarded

Sign up or log in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

Post as a guest

Required, but never shown

By clicking "Post Your Answer", you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.