Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Statistical Analysis

Temp edited this page Oct 3, 2025 · 1 revision

Statistical Analysis Tools

8 specialized tools for descriptive statistics, correlation analysis, and time-series operations in PostgreSQL.


πŸ“Š Overview

Tool Purpose Statistical Method
stat_describe Descriptive statistics Mean, median, stddev, percentiles
stat_correlation Correlation analysis Pearson correlation coefficient
stat_regression Linear regression Least squares regression
stat_distribution Distribution analysis Histogram, quartiles, outliers
stat_time_series Time-series analysis Moving averages, trends
stat_aggregates Advanced aggregations Mode, variance, skewness
stat_percentile Percentile calculations Custom percentile ranges
stat_outliers Outlier detection IQR method, z-score

πŸ”§ Tool Details

stat_describe

Calculate comprehensive descriptive statistics for numeric columns.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Numeric column to analyze
  • group_by (string, optional): Column to group by

Returns:

  • count: Number of values
  • mean: Average value
  • median: Middle value
  • mode: Most frequent value
  • stddev: Standard deviation
  • variance: Variance
  • min: Minimum value
  • max: Maximum value
  • quartiles: Q1, Q2 (median), Q3
  • iqr: Interquartile range

Example:

result = stat_describe(
 table_name="sales",
 column_name="amount"
)
# Returns: {
# "count": 10000,
# "mean": 125.45,
# "median": 98.50,
# "stddev": 45.23,
# "min": 5.00,
# "max": 500.00,
# "quartiles": {"q1": 65.00, "q2": 98.50, "q3": 145.00},
# "iqr": 80.00
# }
# With grouping
result = stat_describe(
 table_name="sales",
 column_name="amount",
 group_by="region"
)

Use Cases:

  • Data exploration
  • Quality assurance
  • Performance baselines
  • Report generation

stat_correlation

Calculate correlation between two numeric columns.

Parameters:

  • table_name (string, required): Source table
  • column1 (string, required): First numeric column
  • column2 (string, required): Second numeric column

Returns:

  • correlation: Pearson correlation coefficient (-1 to 1)
  • p_value: Statistical significance
  • sample_size: Number of data points
  • interpretation: Human-readable interpretation

Example:

result = stat_correlation(
 table_name="products",
 column1="price",
 column2="sales_volume"
)
# Returns: {
# "correlation": -0.78,
# "p_value": 0.0001,
# "sample_size": 500,
# "interpretation": "Strong negative correlation"
# }

Interpretation:

  • 1.0: Perfect positive correlation
  • 0.7 to 0.9: Strong positive correlation
  • 0.4 to 0.6: Moderate positive correlation
  • 0.1 to 0.3: Weak positive correlation
  • 0.0: No correlation
  • -0.1 to -0.3: Weak negative correlation
  • -0.4 to -0.6: Moderate negative correlation
  • -0.7 to -0.9: Strong negative correlation
  • -1.0: Perfect negative correlation

Use Cases:

  • Price vs demand analysis
  • Feature selection for ML
  • A/B test validation
  • Business metric relationships

stat_regression

Perform linear regression analysis.

Parameters:

  • table_name (string, required): Source table
  • x_column (string, required): Independent variable
  • y_column (string, required): Dependent variable

Returns:

  • slope: Regression coefficient
  • intercept: Y-intercept
  • r_squared: Coefficient of determination
  • equation: Regression equation
  • predictions: Sample predictions

Example:

result = stat_regression(
 table_name="marketing",
 x_column="ad_spend",
 y_column="revenue"
)
# Returns: {
# "slope": 3.45,
# "intercept": 1000.00,
# "r_squared": 0.85,
# "equation": "y = 3.45x + 1000.00",
# "interpretation": "For every 1ドル in ad spend, revenue increases by 3ドル.45"
# }

Use Cases:

  • Revenue forecasting
  • Cost prediction
  • Trend analysis
  • ROI calculation

stat_distribution

Analyze data distribution with histograms and frequency analysis.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • num_bins (integer, optional): Number of histogram bins (default: 10)

Returns:

  • histogram: Frequency distribution
  • quartiles: Q1, Q2, Q3, Q4
  • outliers: Values outside normal range
  • skewness: Distribution skew
  • kurtosis: Distribution peakedness

Example:

result = stat_distribution(
 table_name="orders",
 column_name="order_value",
 num_bins=10
)
# Returns: {
# "histogram": [
# {"bin": "0-50", "count": 120},
# {"bin": "50-100", "count": 450},
# {"bin": "100-150", "count": 380},
# ...
# ],
# "quartiles": [25, 75, 125, 250],
# "outliers": {"low": [], "high": [850, 920, 1050]},
# "skewness": 0.45,
# "kurtosis": 2.1
# }

Use Cases:

  • Data quality checks
  • Anomaly detection
  • Pricing strategy
  • Inventory optimization

stat_time_series

Analyze time-series data with moving averages and trends.

Parameters:

  • table_name (string, required): Source table
  • time_column (string, required): Timestamp/date column
  • value_column (string, required): Numeric value column
  • window_size (integer, optional): Moving average window (default: 7)

Returns:

  • moving_average: Rolling average values
  • trend: Linear trend direction
  • seasonality: Detected patterns
  • forecast: Next period prediction

Example:

result = stat_time_series(
 table_name="daily_sales",
 time_column="sale_date",
 value_column="total_amount",
 window_size=7
)
# Returns: {
# "moving_average": [
# {"date": "2025-10-01", "value": 1250.50, "ma_7": 1200.00},
# {"date": "2025-10-02", "value": 1300.00, "ma_7": 1225.00},
# ...
# ],
# "trend": "increasing",
# "trend_slope": 15.5,
# "forecast_next": 1450.00
# }

Use Cases:

  • Sales forecasting
  • Demand planning
  • Performance monitoring
  • Capacity planning

stat_aggregates

Calculate advanced statistical aggregates.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • group_by (string, optional): Grouping column

Returns:

  • mode: Most frequent value
  • variance: Variance
  • stddev_pop: Population standard deviation
  • stddev_samp: Sample standard deviation
  • coef_variation: Coefficient of variation

Example:

result = stat_aggregates(
 table_name="sensor_data",
 column_name="temperature",
 group_by="device_id"
)
# Returns: {
# "groups": [
# {
# "device_id": "sensor_1",
# "mode": 22.5,
# "variance": 4.2,
# "stddev_pop": 2.05,
# "coef_variation": 0.091
# },
# ...
# ]
# }

Use Cases:

  • Quality control
  • Process monitoring
  • Variance analysis
  • Data validation

stat_percentile

Calculate custom percentiles and quantiles.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Numeric column
  • percentiles (list, required): List of percentiles (0-100)

Returns:

  • Percentile values for requested percentiles

Example:

result = stat_percentile(
 table_name="response_times",
 column_name="latency_ms",
 percentiles=[50, 90, 95, 99]
)
# Returns: {
# "p50": 45.2,
# "p90": 125.8,
# "p95": 180.5,
# "p99": 450.0
# }

Use Cases:

  • SLA monitoring
  • Performance analysis
  • Capacity planning
  • Quality metrics

stat_outliers

Detect outliers using statistical methods.

Parameters:

  • table_name (string, required): Source table
  • column_name (string, required): Column to analyze
  • method (string, optional): Detection method (iqr, zscore)
  • threshold (number, optional): Detection threshold

Returns:

  • outliers: List of outlier values
  • lower_bound: Lower threshold
  • upper_bound: Upper threshold
  • outlier_count: Number of outliers
  • outlier_percentage: Percentage of data

Example:

# IQR method (default)
result = stat_outliers(
 table_name="transactions",
 column_name="amount",
 method="iqr"
)
# Returns: {
# "method": "iqr",
# "lower_bound": -50.00,
# "upper_bound": 350.00,
# "outliers": [450.00, 520.00, 890.00],
# "outlier_count": 3,
# "outlier_percentage": 0.03
# }
# Z-score method
result = stat_outliers(
 table_name="transactions",
 column_name="amount",
 method="zscore",
 threshold=3.0
)

Use Cases:

  • Fraud detection
  • Data cleansing
  • Anomaly detection
  • Quality assurance

🎯 Common Workflows

Data Exploration Workflow

# 1. Get descriptive statistics
desc = stat_describe(
 table_name="sales",
 column_name="revenue"
)
# 2. Check distribution
dist = stat_distribution(
 table_name="sales",
 column_name="revenue",
 num_bins=20
)
# 3. Detect outliers
outliers = stat_outliers(
 table_name="sales",
 column_name="revenue",
 method="iqr"
)

Correlation Analysis Workflow

# 1. Check correlations
corr_price_sales = stat_correlation(
 table_name="products",
 column1="price",
 column2="units_sold"
)
corr_marketing_revenue = stat_correlation(
 table_name="campaigns",
 column1="ad_spend",
 column2="revenue"
)
# 2. Build regression model
regression = stat_regression(
 table_name="campaigns",
 x_column="ad_spend",
 y_column="revenue"
)

Time-Series Analysis Workflow

# 1. Calculate moving averages
time_series = stat_time_series(
 table_name="daily_metrics",
 time_column="date",
 value_column="revenue",
 window_size=7
)
# 2. Get percentiles for SLA
percentiles = stat_percentile(
 table_name="daily_metrics",
 column_name="response_time",
 percentiles=[50, 90, 95, 99]
)
# 3. Detect anomalies
outliers = stat_outliers(
 table_name="daily_metrics",
 column_name="revenue",
 method="zscore",
 threshold=3.0
)

πŸ“Š Best Practices

1. Data Quality First

# Always check for nulls and outliers first
desc = stat_describe(table_name="data", column_name="value")
if desc["count"] < total_rows:
 print("Warning: Missing values detected")
outliers = stat_outliers(table_name="data", column_name="value")
if outliers["outlier_percentage"] > 0.05:
 print("Warning: >5% outliers detected")

2. Use Appropriate Grouping

# Regional analysis
stat_describe(
 table_name="sales",
 column_name="revenue",
 group_by="region"
)
# Time-based analysis
stat_aggregates(
 table_name="metrics",
 column_name="value",
 group_by="DATE_TRUNC('month', timestamp)"
)

3. Validate Correlations

corr = stat_correlation(table_name="data", column1="x", column2="y")
# Check statistical significance
if corr["p_value"] < 0.05:
 print("Correlation is statistically significant")
 
# Check sample size
if corr["sample_size"] < 30:
 print("Warning: Small sample size")

πŸ“š Related Documentation


πŸ”— External Resources


See Home for more tool categories.

Clone this wiki locally

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /