You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A DuckDB extension that add comprehensive statistical distribution functions to DuckDB, enabling advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.
The stochastic developed by Query.Farm extension adds comprehensive statistical distribution functions to DuckDB, enabling advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.
INSTALL stochastic FROM community;
LOAD stochastic;
What are statistical distributions?
Statistical distributions are mathematical functions that describe the probability of different outcomes in a dataset. They are fundamental to statistics, data science, machine learning, and scientific computing. This extension provides functions to:
Calculate probability density and mass functions (PDF/PMF)
Compute cumulative distribution functions (CDF)
Generate quantiles (inverse CDF)
Sample random values from distributions
Access distribution properties (mean, variance, etc.)
Available Distributions
The extension supports a comprehensive set of probability distributions:
Below are the parameters for each supported distribution. Use these as arguments for sampling, PDF, CDF, and other functions.
Continuous Distributions
Beta
Parameter
Description
alpha
Shape parameter α (> 0)
beta
Shape parameter β (> 0)
Cauchy
Parameter
Description
location
Location parameter x0
scale
Scale parameter γ (> 0)
Chi-squared
Parameter
Description
df
Degrees of freedom (> 0)
Exponential
Parameter
Description
rate
Rate parameter λ (> 0)
Extreme Value
Parameter
Description
location
Location parameter
scale
Scale parameter (> 0)
Fisher F
Parameter
Description
df1
Numerator degrees of freedom (> 0)
df2
Denominator degrees of freedom (> 0)
Gamma
Parameter
Description
shape
Shape parameter k (> 0)
scale
Scale parameter θ (> 0)
Laplace
Parameter
Description
location
Location parameter μ
scale
Scale parameter b (> 0)
Log-normal
Parameter
Description
meanlog
Mean of log values
sdlog
Standard deviation of log values (> 0)
Logistic
Parameter
Description
location
Location parameter
scale
Scale parameter (> 0)
Normal (Gaussian)
Parameter
Description
mean
Mean μ
stddev
Standard deviation σ (> 0)
Pareto
Parameter
Description
scale
Scale parameter xm (> 0)
shape
Shape parameter α (> 0)
Rayleigh
Parameter
Description
scale
Scale parameter σ (> 0)
Student's t
Parameter
Description
df
Degrees of freedom (> 0)
Uniform (Real)
Parameter
Description
min
Lower bound
max
Upper bound (must be > min)
Weibull
Parameter
Description
shape
Shape parameter k (> 0)
scale
Scale parameter λ (> 0)
Discrete Distributions
Bernoulli
Parameter
Description
p
Probability of success (0 ≤ p ≤ 1)
Binomial
Parameter
Description
n
Number of trials (integer ≥ 0)
p
Probability of success (0 ≤ p ≤ 1)
Geometric
Parameter
Description
p
Probability of success (0 ≤ p ≤ 1)
Negative Binomial
Parameter
Description
r
Number of successes (integer > 0)
p
Probability of success (0 ≤ p ≤ 1)
Poisson
Parameter
Description
rate
Rate parameter λ (> 0)
Uniform (Integer)
Parameter
Description
min
Lower bound (integer)
max
Upper bound (integer, must be ≥ min)
Usage Examples
Normal Distribution
-- Generate random samples from N(0, 1)SELECT dist_normal_sample(0.0, 1.0) AS random_value;
-- Calculate PDF at x = 0.5 for N(0, 1)SELECT dist_normal_pdf(0.0, 1.0, 0.5) AS density;
-- Calculate CDF (probability that X ≤ 1.96)SELECT dist_normal_cdf(0.0, 1.0, 1.96) AS probability;
-- Find 95th percentileSELECT dist_normal_quantile(0.0, 1.0, 0.95) AS percentile_95;
-- Get distribution propertiesSELECT
dist_normal_mean(0.0, 1.0) AS mean,
dist_normal_variance(0.0, 1.0) AS variance,
dist_normal_skewness(0.0, 1.0) AS skewness;
Binomial Distribution
-- Probability mass function for 10 trials, p=0.3SELECT dist_binomial_pdf(10, 0.3, 7) AS prob_exactly_7;
-- Cumulative probability (≤ 5 successes)SELECT dist_binomial_cdf(10, 0.3, 5) AS prob_at_most_5;
-- Generate random binomial samplesSELECT dist_binomial_sample(10, 0.3) AS random_successes;
Working with Data Tables
-- Generate synthetic datasetCREATETABLEsynthetic_dataASSELECT
i,
dist_normal_sample(100, 15) AS height_cm,
dist_normal_sample(70, 10) AS weight_kg,
dist_binomial_sample(1, 0.5) AS gender -- 0 or 1FROM range(1000) t(i);
-- Calculate z-scoresSELECT
height_cm,
(height_cm - dist_normal_mean(100, 15)) / dist_normal_stddev(100, 15) AS height_zscore
FROM synthetic_data;
-- Probability calculationsSELECT
weight_kg,
dist_normal_cdf(70, 10, weight_kg) AS percentile
FROM synthetic_data
LIMIT10;
Real-World Applications
A/B Testing and Statistical Significance
Common Task: Determine if there's a statistically significant difference between conversion rates.
Relevant Functions: dist_normal_cdf, dist_normal_cdf_complement, dist_normal_pdf
Financial Risk Assessment and VaR Calculation
Common Task: Calculate Value at Risk (VaR) for portfolio management.
Relevant Functions: dist_normal_sample, dist_normal_quantile, dist_normal_cdf
Quality Control and Process Monitoring
Common Task: Monitor manufacturing processes and detect out-of-control conditions.
Relevant Functions: dist_normal_sample, dist_normal_cdf, dist_normal_pdf
Predictive Analytics and Confidence Intervals
Common Task: Build prediction intervals for forecasting models.
Relevant Functions: dist_normal_quantile, dist_normal_cdf, dist_normal_sample
Customer Analytics and CLV Modeling
Common Task: Model customer lifetime value with uncertainty quantification.
Relevant Functions: dist_normal_sample, dist_exponential_sample, dist_normal_quantile, dist_normal_cdf
Anomaly Detection and Outlier Analysis
Common Task: Detect anomalies in time series data using statistical methods.
Relevant Functions: dist_normal_pdf, dist_normal_cdf, dist_normal_cdf_complement
Monte Carlo Simulations
Common Task: Run Monte Carlo simulations for risk analysis, optimization, or modeling.
Relevant Functions: dist_normal_sample, dist_uniform_real_sample, dist_gamma_sample, dist_beta_sample
Common Task: Implement Bayesian statistical models and posterior analysis.
Relevant Functions: dist_beta_pdf, dist_gamma_pdf, dist_normal_pdf, dist_beta_sample
Survival Analysis
Common Task: Analyze time-to-event data in medical research or reliability engineering.
Relevant Functions: dist_exponential_pdf, dist_weibull_pdf, dist_gamma_pdf, dist_exponential_cdf
Why Use DuckDB + Stochastic vs Python/R?
✅ Advantages
No Data Movement: Analysis happens where your data lives
SQL Familiarity: Use existing SQL skills instead of learning specialized libraries
Performance: Columnar processing with vectorized statistical operations
Integration: Works seamlessly with existing BI tools and SQL workflows
Real-time: Analyze streaming data without export/import cycles
📊 Performance Benefits
Statistical operations are vectorized and optimized for DuckDB's columnar engine.
Parameter Validation
All distribution functions include comprehensive parameter validation:
-- This will throw an error: standard deviation must be > 0SELECT dist_normal_pdf(0.0, -1.0, 0.5);
-- Error: normal: Standard deviation must be > 0 was: -1.000000-- This will throw an error: probability must be between 0 and 1SELECT dist_binomial_pdf(10, 1.5, 5);
-- Error: binomial: Probability must be between 0 and 1 was: 1.500000
License
MIT Licensed
About
A DuckDB extension that add comprehensive statistical distribution functions to DuckDB, enabling advanced statistical analysis, probability calculations, and random sampling directly within SQL queries.