Confidence intervals with scipy

Question 1

I have an array of shape (n, timesteps), where n is the number of trials and timesteps is the length of each trial. Each value of this array denotes a stochastic measurement.
I would like to implement a generic function that computes a confidence interval for a given statistic (mean, median, ...) assuming 1) the underlying distribution is normal, 2) is Student's.

Something like

def normal_ci(
 data: np.array,
 axis: int = 0,
 statistic: Callable = np.mean,
 confidence: float = 0.95
):

and a similar function student_ci().

My problem is that, by default, scipy functions compute intervals for the mean, am I right? Like in this answer.

Question 2

This is on Stack Overflow, so the computational answer is to use the bootstrap.

from typing import Callable
import numpy as np
from scipy import stats
rng = np.random.default_rng(84354894298246)
def normal_ci(
 data: np.array,
 axis: int = 0,
 statistic: Callable = np.mean,
 confidence: float = 0.95
):
 res = stats.bootstrap((data,), statistic, axis=axis,
 confidence_level=confidence)
 return tuple(res.confidence_interval)
# generate 1000 samples, each of length 100
data = rng.standard_normal(size=(1000, 100))
# compute the confidence interval for each
low, high = normal_ci(data, axis=-1)
# the confidence interval contains the population value
# of the statistic in 95% of cases
np.mean((low < 0) & (0 < high)) # 0.953

Since you know the families from which your data are drawn, you could look into the parametric bootstrap.

def normal_ci(
 data: np.array,
 axis: int = 0,
 statistic: Callable = np.mean,
 confidence: float = 0.95
):
 # fit a normal distribution to the data
 mean = np.mean(data, axis=axis)
 std = np.std(data, ddof=1, axis=axis)
 # resample data from the fitted normal distribution
 n_resamples = 999
 m = data.shape[axis] # assuming axis is an integer
 resample = rng.normal(loc=mean, scale=std, size=(m, n_resamples) + mean.shape)
 # compute the statistic for each of the resamples to estimate
 # the distribution
 statistic_distribution = statistic(resample, axis=0)
 # Generate the confidence interval
 # percentile bootstrap (very crude)
 alpha = 1 - confidence
 low, high = np.quantile(statistic_distribution, [alpha/2, 1-alpha/2], axis=0)
 return low, high
low, high = normal_ci(data, axis=-1)
np.mean((low < 0) & (0 < high)) # 0.954

SciPy's bootstrap function uses the BCa method by default, which is reasonably sophisticated; this parametric bootstrap is what I could write in a few minutes. So for parametric bootstrap, you really should research what other libraries are out there.

If you're not happy with the bootstrap, you'd need to do some research: for each of the statistics and population families you're interested in, you need to know the distribution of the statistic of a sample drawn from your population. For a sample from a normal distribution, the sample mean is t-distributed, the variance is chi-squared distributed, etc... and that is not a question for Stack Overflow.

Question 3

Thanks, I have already implemented bootstrapping but I didn't think about the parametric version. Thanks! I tried yours but when I pass statistic = np.median to your parametric implementation, I get an interval that is almost the same to the mean's. This does not happen if I do non-parametric bootstrapping.

Question 4

Both are approximate and stochastic. I explained that the second was a crude sketch of parametric bootstrapping. So yeah, they may give different results. Check whether the coverage seems to be about right over many trials; that is the true test. Thanks are appreciated, but please remember Stack Overflow's recommendations on how to say it: "Comments are not recommended for any of the following... Compliments which do not add new information ("+1, great answer!"); instead, upvote it and pay it forward;" (stackoverflow.com/help/privileges/comment).

Question 5

Sure, I always upvote and accept an answer if it's right :) Sometime I wait a bit because when I do and ask for something with a comment, the person who answered just ignores my comments. Thanks again for replying!

Matt Haberland 4,4781 gold badge12 silver badges34 bronze badges · Accepted Answer · 2024-11-10 04:33:54Z

This is on Stack Overflow, so the computational answer is to use the bootstrap.

from typing import Callable
import numpy as np
from scipy import stats
rng = np.random.default_rng(84354894298246)
def normal_ci(
 data: np.array,
 axis: int = 0,
 statistic: Callable = np.mean,
 confidence: float = 0.95
):
 res = stats.bootstrap((data,), statistic, axis=axis,
 confidence_level=confidence)
 return tuple(res.confidence_interval)
# generate 1000 samples, each of length 100
data = rng.standard_normal(size=(1000, 100))
# compute the confidence interval for each
low, high = normal_ci(data, axis=-1)
# the confidence interval contains the population value
# of the statistic in 95% of cases
np.mean((low < 0) & (0 < high)) # 0.953

Since you know the families from which your data are drawn, you could look into the parametric bootstrap.

def normal_ci(
 data: np.array,
 axis: int = 0,
 statistic: Callable = np.mean,
 confidence: float = 0.95
):
 # fit a normal distribution to the data
 mean = np.mean(data, axis=axis)
 std = np.std(data, ddof=1, axis=axis)
 # resample data from the fitted normal distribution
 n_resamples = 999
 m = data.shape[axis] # assuming axis is an integer
 resample = rng.normal(loc=mean, scale=std, size=(m, n_resamples) + mean.shape)
 # compute the statistic for each of the resamples to estimate
 # the distribution
 statistic_distribution = statistic(resample, axis=0)
 # Generate the confidence interval
 # percentile bootstrap (very crude)
 alpha = 1 - confidence
 low, high = np.quantile(statistic_distribution, [alpha/2, 1-alpha/2], axis=0)
 return low, high
low, high = normal_ci(data, axis=-1)
np.mean((low < 0) & (0 < high)) # 0.954

SciPy's bootstrap function uses the BCa method by default, which is reasonably sophisticated; this parametric bootstrap is what I could write in a few minutes. So for parametric bootstrap, you really should research what other libraries are out there.

If you're not happy with the bootstrap, you'd need to do some research: for each of the statistics and population families you're interested in, you need to know the distribution of the statistic of a sample drawn from your population. For a sample from a normal distribution, the sample mean is t-distributed, the variance is chi-squared distributed, etc... and that is not a question for Stack Overflow.

Thanks, I have already implemented bootstrapping but I didn't think about the parametric version. Thanks! I tried yours but when I pass statistic = np.median to your parametric implementation, I get an interval that is almost the same to the mean's. This does not happen if I do non-parametric bootstrapping.
Both are approximate and stochastic. I explained that the second was a crude sketch of parametric bootstrapping. So yeah, they may give different results. Check whether the coverage seems to be about right over many trials; that is the true test. Thanks are appreciated, but please remember Stack Overflow's recommendations on how to say it: "Comments are not recommended for any of the following... Compliments which do not add new information ("+1, great answer!"); instead, upvote it and pay it forward;" (stackoverflow.com/help/privileges/comment).
Sure, I always upvote and accept an answer if it's right :) Sometime I wait a bit because when I do and ask for something with a comment, the person who answered just ignores my comments. Thanks again for replying!

CollectivesTM on Stack Overflow

Confidence intervals with scipy

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Linked

Related