-
Notifications
You must be signed in to change notification settings - Fork 0
Statistical Analysis
8 specialized tools for descriptive statistics, correlation analysis, and time-series operations in PostgreSQL.
| Tool | Purpose | Statistical Method |
|---|---|---|
stat_describe |
Descriptive statistics | Mean, median, stddev, percentiles |
stat_correlation |
Correlation analysis | Pearson correlation coefficient |
stat_regression |
Linear regression | Least squares regression |
stat_distribution |
Distribution analysis | Histogram, quartiles, outliers |
stat_time_series |
Time-series analysis | Moving averages, trends |
stat_aggregates |
Advanced aggregations | Mode, variance, skewness |
stat_percentile |
Percentile calculations | Custom percentile ranges |
stat_outliers |
Outlier detection | IQR method, z-score |
Calculate comprehensive descriptive statistics for numeric columns.
Parameters:
-
table_name(string, required): Source table -
column_name(string, required): Numeric column to analyze -
group_by(string, optional): Column to group by
Returns:
-
count: Number of values -
mean: Average value -
median: Middle value -
mode: Most frequent value -
stddev: Standard deviation -
variance: Variance -
min: Minimum value -
max: Maximum value -
quartiles: Q1, Q2 (median), Q3 -
iqr: Interquartile range
Example:
result = stat_describe( table_name="sales", column_name="amount" ) # Returns: { # "count": 10000, # "mean": 125.45, # "median": 98.50, # "stddev": 45.23, # "min": 5.00, # "max": 500.00, # "quartiles": {"q1": 65.00, "q2": 98.50, "q3": 145.00}, # "iqr": 80.00 # } # With grouping result = stat_describe( table_name="sales", column_name="amount", group_by="region" )
Use Cases:
- Data exploration
- Quality assurance
- Performance baselines
- Report generation
Calculate correlation between two numeric columns.
Parameters:
-
table_name(string, required): Source table -
column1(string, required): First numeric column -
column2(string, required): Second numeric column
Returns:
-
correlation: Pearson correlation coefficient (-1 to 1) -
p_value: Statistical significance -
sample_size: Number of data points -
interpretation: Human-readable interpretation
Example:
result = stat_correlation( table_name="products", column1="price", column2="sales_volume" ) # Returns: { # "correlation": -0.78, # "p_value": 0.0001, # "sample_size": 500, # "interpretation": "Strong negative correlation" # }
Interpretation:
-
1.0: Perfect positive correlation -
0.7 to 0.9: Strong positive correlation -
0.4 to 0.6: Moderate positive correlation -
0.1 to 0.3: Weak positive correlation -
0.0: No correlation -
-0.1 to -0.3: Weak negative correlation -
-0.4 to -0.6: Moderate negative correlation -
-0.7 to -0.9: Strong negative correlation -
-1.0: Perfect negative correlation
Use Cases:
- Price vs demand analysis
- Feature selection for ML
- A/B test validation
- Business metric relationships
Perform linear regression analysis.
Parameters:
-
table_name(string, required): Source table -
x_column(string, required): Independent variable -
y_column(string, required): Dependent variable
Returns:
-
slope: Regression coefficient -
intercept: Y-intercept -
r_squared: Coefficient of determination -
equation: Regression equation -
predictions: Sample predictions
Example:
result = stat_regression( table_name="marketing", x_column="ad_spend", y_column="revenue" ) # Returns: { # "slope": 3.45, # "intercept": 1000.00, # "r_squared": 0.85, # "equation": "y = 3.45x + 1000.00", # "interpretation": "For every 1γγ« in ad spend, revenue increases by 3γγ«.45" # }
Use Cases:
- Revenue forecasting
- Cost prediction
- Trend analysis
- ROI calculation
Analyze data distribution with histograms and frequency analysis.
Parameters:
-
table_name(string, required): Source table -
column_name(string, required): Column to analyze -
num_bins(integer, optional): Number of histogram bins (default: 10)
Returns:
-
histogram: Frequency distribution -
quartiles: Q1, Q2, Q3, Q4 -
outliers: Values outside normal range -
skewness: Distribution skew -
kurtosis: Distribution peakedness
Example:
result = stat_distribution( table_name="orders", column_name="order_value", num_bins=10 ) # Returns: { # "histogram": [ # {"bin": "0-50", "count": 120}, # {"bin": "50-100", "count": 450}, # {"bin": "100-150", "count": 380}, # ... # ], # "quartiles": [25, 75, 125, 250], # "outliers": {"low": [], "high": [850, 920, 1050]}, # "skewness": 0.45, # "kurtosis": 2.1 # }
Use Cases:
- Data quality checks
- Anomaly detection
- Pricing strategy
- Inventory optimization
Analyze time-series data with moving averages and trends.
Parameters:
-
table_name(string, required): Source table -
time_column(string, required): Timestamp/date column -
value_column(string, required): Numeric value column -
window_size(integer, optional): Moving average window (default: 7)
Returns:
-
moving_average: Rolling average values -
trend: Linear trend direction -
seasonality: Detected patterns -
forecast: Next period prediction
Example:
result = stat_time_series( table_name="daily_sales", time_column="sale_date", value_column="total_amount", window_size=7 ) # Returns: { # "moving_average": [ # {"date": "2025-10-01", "value": 1250.50, "ma_7": 1200.00}, # {"date": "2025-10-02", "value": 1300.00, "ma_7": 1225.00}, # ... # ], # "trend": "increasing", # "trend_slope": 15.5, # "forecast_next": 1450.00 # }
Use Cases:
- Sales forecasting
- Demand planning
- Performance monitoring
- Capacity planning
Calculate advanced statistical aggregates.
Parameters:
-
table_name(string, required): Source table -
column_name(string, required): Column to analyze -
group_by(string, optional): Grouping column
Returns:
-
mode: Most frequent value -
variance: Variance -
stddev_pop: Population standard deviation -
stddev_samp: Sample standard deviation -
coef_variation: Coefficient of variation
Example:
result = stat_aggregates( table_name="sensor_data", column_name="temperature", group_by="device_id" ) # Returns: { # "groups": [ # { # "device_id": "sensor_1", # "mode": 22.5, # "variance": 4.2, # "stddev_pop": 2.05, # "coef_variation": 0.091 # }, # ... # ] # }
Use Cases:
- Quality control
- Process monitoring
- Variance analysis
- Data validation
Calculate custom percentiles and quantiles.
Parameters:
-
table_name(string, required): Source table -
column_name(string, required): Numeric column -
percentiles(list, required): List of percentiles (0-100)
Returns:
- Percentile values for requested percentiles
Example:
result = stat_percentile( table_name="response_times", column_name="latency_ms", percentiles=[50, 90, 95, 99] ) # Returns: { # "p50": 45.2, # "p90": 125.8, # "p95": 180.5, # "p99": 450.0 # }
Use Cases:
- SLA monitoring
- Performance analysis
- Capacity planning
- Quality metrics
Detect outliers using statistical methods.
Parameters:
-
table_name(string, required): Source table -
column_name(string, required): Column to analyze -
method(string, optional): Detection method (iqr,zscore) -
threshold(number, optional): Detection threshold
Returns:
-
outliers: List of outlier values -
lower_bound: Lower threshold -
upper_bound: Upper threshold -
outlier_count: Number of outliers -
outlier_percentage: Percentage of data
Example:
# IQR method (default) result = stat_outliers( table_name="transactions", column_name="amount", method="iqr" ) # Returns: { # "method": "iqr", # "lower_bound": -50.00, # "upper_bound": 350.00, # "outliers": [450.00, 520.00, 890.00], # "outlier_count": 3, # "outlier_percentage": 0.03 # } # Z-score method result = stat_outliers( table_name="transactions", column_name="amount", method="zscore", threshold=3.0 )
Use Cases:
- Fraud detection
- Data cleansing
- Anomaly detection
- Quality assurance
# 1. Get descriptive statistics desc = stat_describe( table_name="sales", column_name="revenue" ) # 2. Check distribution dist = stat_distribution( table_name="sales", column_name="revenue", num_bins=20 ) # 3. Detect outliers outliers = stat_outliers( table_name="sales", column_name="revenue", method="iqr" )
# 1. Check correlations corr_price_sales = stat_correlation( table_name="products", column1="price", column2="units_sold" ) corr_marketing_revenue = stat_correlation( table_name="campaigns", column1="ad_spend", column2="revenue" ) # 2. Build regression model regression = stat_regression( table_name="campaigns", x_column="ad_spend", y_column="revenue" )
# 1. Calculate moving averages time_series = stat_time_series( table_name="daily_metrics", time_column="date", value_column="revenue", window_size=7 ) # 2. Get percentiles for SLA percentiles = stat_percentile( table_name="daily_metrics", column_name="response_time", percentiles=[50, 90, 95, 99] ) # 3. Detect anomalies outliers = stat_outliers( table_name="daily_metrics", column_name="revenue", method="zscore", threshold=3.0 )
# Always check for nulls and outliers first desc = stat_describe(table_name="data", column_name="value") if desc["count"] < total_rows: print("Warning: Missing values detected") outliers = stat_outliers(table_name="data", column_name="value") if outliers["outlier_percentage"] > 0.05: print("Warning: >5% outliers detected")
# Regional analysis stat_describe( table_name="sales", column_name="revenue", group_by="region" ) # Time-based analysis stat_aggregates( table_name="metrics", column_name="value", group_by="DATE_TRUNC('month', timestamp)" )
corr = stat_correlation(table_name="data", column1="x", column2="y") # Check statistical significance if corr["p_value"] < 0.05: print("Correlation is statistically significant") # Check sample size if corr["sample_size"] < 30: print("Warning: Small sample size")
- Core Database Tools - Basic SQL operations
- Performance Intelligence - Query optimization
- Monitoring & Alerting - Real-time metrics
See Home for more tool categories.