Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Monitoring Alerting

Temp edited this page Oct 3, 2025 · 2 revisions

Monitoring & Alerting Tools

Real-time monitoring, alerting, capacity planning, and resource analysis for production PostgreSQL databases.


πŸ“Š Overview

5 enterprise monitoring tools for comprehensive database observability:

Tool Purpose Category
monitor_real_time Real-time performance monitoring Observability
alert_threshold_set Metric threshold analysis and alerting Alerting
capacity_planning Growth projection and capacity forecasting Planning
resource_usage_analyze CPU/Memory/IO resource analysis Performance
replication_monitor Replication status and lag monitoring High Availability

πŸ”§ Tool Details

monitor_real_time

Real-time monitoring of database performance metrics including queries, locks, connections, and I/O.

Parameters:

  • include_queries (boolean, optional): Include currently running queries
  • include_locks (boolean, optional): Include lock information
  • include_io (boolean, optional): Include I/O statistics
  • limit (integer, optional): Limit number of results (default: 10)

Returns:

  • timestamp: Current timestamp
  • metrics.connections: Connection statistics by state
  • metrics.active_queries: Currently running queries
  • metrics.locks: Lock information by type
  • metrics.io_statistics: Heap and index block statistics
  • metrics.database: Database size and modifications

Example:

result = monitor_real_time(
 include_queries=True,
 include_locks=True,
 include_io=True,
 limit=10
)
# Returns: {
# "timestamp": "2025-10-03 15:30:45",
# "metrics": {
# "connections": {
# "total": 25,
# "by_state": [
# {"state": "active", "count": 3},
# {"state": "idle", "count": 22}
# ]
# },
# "active_queries": {...},
# "locks": {"total": 5, "blocked": 0},
# "io_statistics": {
# "heap_hit_ratio_percent": 99.2,
# "index_hit_ratio_percent": 99.8
# }
# }
# }

Use Cases:

  • Production monitoring dashboards
  • Real-time performance troubleshooting
  • Lock contention detection
  • Connection pool monitoring

alert_threshold_set

Analyze database metrics against configurable alert thresholds.

Parameters:

  • metric_type (string, required): Metric to check (cache_hit_ratio, connection_count, transaction_age, database_size, replication_lag)
  • warning_threshold (number, required): Warning threshold value
  • critical_threshold (number, required): Critical threshold value
  • check_current (boolean, optional): Check current value against thresholds

Returns:

  • metric_type: Metric being monitored
  • thresholds: Configured warning and critical thresholds
  • current_value: Current metric value
  • alert_status: Current status (ok, warning, critical)
  • unit: Measurement unit

Example:

# Monitor cache hit ratio
result = alert_threshold_set(
 metric_type="cache_hit_ratio",
 warning_threshold=95.0,
 critical_threshold=90.0,
 check_current=True
)
# Returns: {
# "metric_type": "cache_hit_ratio",
# "thresholds": {"warning": 95.0, "critical": 90.0},
# "current_value": 99.3,
# "alert_status": "ok",
# "unit": "percent"
# }
# Monitor connection pool
result = alert_threshold_set(
 metric_type="connection_count",
 warning_threshold=80,
 critical_threshold=95,
 check_current=True
)

Supported Metrics:

  • cache_hit_ratio - Buffer cache hit percentage
  • connection_count - Active database connections
  • transaction_age - Longest running transaction age (seconds)
  • database_size - Total database size (bytes)
  • replication_lag - Replication lag (seconds, replicas only)

Use Cases:

  • Automated alerting systems
  • Threshold-based monitoring
  • Performance degradation detection
  • Capacity warning systems

capacity_planning

Analyze database growth trends and project future capacity needs.

Parameters:

  • forecast_days (integer, required): Number of days to forecast
  • include_table_growth (boolean, optional): Include per-table growth analysis
  • include_index_growth (boolean, optional): Include per-index growth analysis

Returns:

  • current_state: Current database size and top tables/indexes
  • projections: Growth forecasts for specified period
  • recommendations: Storage and capacity recommendations

Example:

result = capacity_planning(
 forecast_days=90,
 include_table_growth=True,
 include_index_growth=True
)
# Returns: {
# "current_state": {
# "total_size": {"bytes": 50000000000, "gb": 46.57, "pretty": "47 GB"},
# "user_data_size_mb": 35000,
# "index_size_mb": 8500,
# "top_tables": [
# {"table": "orders", "size_mb": 12500, "row_count": 5000000},
# {"table": "users", "size_mb": 8900, "row_count": 2000000}
# ]
# },
# "projections": {
# "forecast_days": 90,
# "estimated_daily_growth_mb": 450,
# "estimated_total_growth_gb": 38.6,
# "projected_total_size_gb": 85.2
# },
# "recommendations": {
# "recommended_storage_gb": 127.8,
# "buffer_percentage": 50,
# "planning_horizon_days": 90
# }
# }

Use Cases:

  • Storage capacity planning
  • Budget forecasting
  • Growth trend analysis
  • Infrastructure scaling decisions

resource_usage_analyze

Analyze CPU, memory, and I/O resource usage patterns.

Parameters:

  • include_cpu (boolean, optional): Include CPU usage analysis
  • include_memory (boolean, optional): Include memory usage analysis
  • include_io (boolean, optional): Include I/O usage analysis

Returns:

  • resource_analysis.memory: Buffer cache and shared memory stats
  • resource_analysis.io: Disk I/O and cache hit ratios
  • resource_analysis.cpu: Query execution time statistics
  • recommendations: Resource optimization suggestions

Example:

result = resource_usage_analyze(
 include_cpu=True,
 include_memory=True,
 include_io=True
)
# Returns: {
# "resource_analysis": {
# "memory": {
# "shared_buffers": "16384",
# "buffer_cache_hit_ratio": 99.5,
# "buffer_hits": 5234567,
# "disk_reads": 25678
# },
# "io": {
# "heap_blocks_from_disk": 12345,
# "heap_blocks_from_cache": 987654,
# "heap_hit_ratio": 98.8,
# "index_blocks_from_disk": 3456,
# "index_blocks_from_cache": 654321,
# "index_hit_ratio": 99.5
# },
# "cpu": {
# "total_execution_time_ms": 125678,
# "total_calls": 456789,
# "avg_query_time_ms": 0.275,
# "max_query_time_ms": 1250.45
# }
# },
# "recommendations": [
# {
# "category": "memory",
# "priority": "info",
# "recommendation": "Buffer cache hit ratio excellent at 99.5%"
# }
# ]
# }

Requirements: pg_stat_statements extension for CPU analysis

Use Cases:

  • Performance bottleneck identification
  • Resource optimization
  • Infrastructure right-sizing
  • Cost optimization

replication_monitor

Monitor replication status, lag, and health for primary and replica databases.

Parameters:

  • include_wal_status (boolean, optional): Include WAL sender/receiver status
  • include_slots (boolean, optional): Include replication slot information

Returns:

  • replication_status.is_replica: Whether this is a replica
  • replication_status.role: Database role (primary/replica)
  • replication_status.wal_senders: WAL sender connections (primary only)
  • replication_status.replication_slots: Active replication slots
  • lag_info: Replication lag statistics (replica only)

Example:

# On primary database
result = replication_monitor(
 include_wal_status=True,
 include_slots=True
)
# Returns: {
# "replication_status": {
# "is_replica": False,
# "role": "primary",
# "wal_senders": {
# "count": 2,
# "senders": [
# {
# "application_name": "replica1",
# "client_addr": "10.0.1.5",
# "state": "streaming",
# "sync_state": "async",
# "sent_lsn": "0/5A2F3C0",
# "write_lsn": "0/5A2F3C0",
# "flush_lsn": "0/5A2F3C0"
# }
# ]
# },
# "replication_slots": {
# "total_count": 2,
# "inactive_count": 0,
# "slots": [...]
# }
# }
# }
# On replica database
result = replication_monitor(include_wal_status=True)
# Returns: {
# "replication_status": {
# "is_replica": True,
# "role": "replica"
# },
# "lag_info": {
# "receive_lsn": "0/5A2F3C0",
# "replay_lsn": "0/5A2F380",
# "lag_bytes": 64,
# "lag_seconds": 0.05,
# "is_replaying": True
# }
# }

Use Cases:

  • High availability monitoring
  • Replication health checks
  • Lag detection and alerting
  • Disaster recovery readiness

🎯 Common Workflows

Production Monitoring Dashboard

# 1. Real-time metrics
metrics = monitor_real_time(
 include_queries=True,
 include_locks=True,
 include_io=True
)
# 2. Check thresholds
cache_status = alert_threshold_set(
 metric_type="cache_hit_ratio",
 warning_threshold=95,
 critical_threshold=90,
 check_current=True
)
conn_status = alert_threshold_set(
 metric_type="connection_count",
 warning_threshold=80,
 critical_threshold=95,
 check_current=True
)
# 3. Resource analysis
resources = resource_usage_analyze(
 include_cpu=True,
 include_memory=True,
 include_io=True
)

Capacity Planning Review

# 1. Analyze growth
capacity = capacity_planning(
 forecast_days=90,
 include_table_growth=True,
 include_index_growth=True
)
# 2. Current resource usage
resources = resource_usage_analyze(
 include_cpu=True,
 include_memory=True,
 include_io=True
)
# 3. Project needs
# Use capacity["recommendations"]["recommended_storage_gb"]
# Use capacity["projections"]["projected_total_size_gb"]

Replication Health Check

# 1. Check replication status
repl_status = replication_monitor(
 include_wal_status=True,
 include_slots=True
)
# 2. Monitor lag
if repl_status["replication_status"]["is_replica"]:
 lag_alert = alert_threshold_set(
 metric_type="replication_lag",
 warning_threshold=5,
 critical_threshold=30,
 check_current=True
 )

πŸ“Š Monitoring Best Practices

1. Regular Health Checks

  • Run monitor_real_time() every 1-5 minutes
  • Check alert_threshold_set() for key metrics
  • Review resource_usage_analyze() daily

2. Capacity Planning

  • Run capacity_planning() monthly
  • Track growth trends over time
  • Plan upgrades 3 months in advance

3. Alert Thresholds

Recommended Thresholds:

# Cache hit ratio
cache_warning = 95.0 # Below 95% investigate
cache_critical = 90.0 # Below 90% urgent action
# Connections
conn_warning = 80 # 80% of max_connections
conn_critical = 95 # 95% of max_connections
# Transaction age
txn_warning = 300 # 5 minutes
txn_critical = 1800 # 30 minutes
# Replication lag
lag_warning = 5 # 5 seconds
lag_critical = 30 # 30 seconds

4. Replication Monitoring

  • Monitor lag every 30 seconds
  • Alert on WAL sender disconnections
  • Watch for inactive replication slots

🚨 Alert Response Guide

High Connection Count

# 1. Identify connections
metrics = monitor_real_time(include_queries=True)
# 2. Analyze queries
from Core import get_top_queries
slow_queries = get_top_queries(sort_by="calls", limit=20)
# 3. Check for connection leaks
# Review application connection pooling

Low Cache Hit Ratio

# 1. Analyze resource usage
resources = resource_usage_analyze(include_memory=True, include_io=True)
# 2. Check buffer cache size
# Consider increasing shared_buffers in postgresql.conf
# 3. Review query patterns
# Identify queries causing excessive disk I/O

Replication Lag

# 1. Monitor replication
repl = replication_monitor(include_wal_status=True)
# 2. Check network connectivity
# Verify network bandwidth between primary and replica
# 3. Analyze replica load
# Check if replica is under heavy read load

πŸ“š Related Documentation


See Home for more tool categories.

Clone this wiki locally

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /