-
Notifications
You must be signed in to change notification settings - Fork 0
Monitoring Alerting
Real-time monitoring, alerting, capacity planning, and resource analysis for production PostgreSQL databases.
5 enterprise monitoring tools for comprehensive database observability:
| Tool | Purpose | Category |
|---|---|---|
monitor_real_time |
Real-time performance monitoring | Observability |
alert_threshold_set |
Metric threshold analysis and alerting | Alerting |
capacity_planning |
Growth projection and capacity forecasting | Planning |
resource_usage_analyze |
CPU/Memory/IO resource analysis | Performance |
replication_monitor |
Replication status and lag monitoring | High Availability |
Real-time monitoring of database performance metrics including queries, locks, connections, and I/O.
Parameters:
-
include_queries(boolean, optional): Include currently running queries -
include_locks(boolean, optional): Include lock information -
include_io(boolean, optional): Include I/O statistics -
limit(integer, optional): Limit number of results (default: 10)
Returns:
-
timestamp: Current timestamp -
metrics.connections: Connection statistics by state -
metrics.active_queries: Currently running queries -
metrics.locks: Lock information by type -
metrics.io_statistics: Heap and index block statistics -
metrics.database: Database size and modifications
Example:
result = monitor_real_time( include_queries=True, include_locks=True, include_io=True, limit=10 ) # Returns: { # "timestamp": "2025-10-03 15:30:45", # "metrics": { # "connections": { # "total": 25, # "by_state": [ # {"state": "active", "count": 3}, # {"state": "idle", "count": 22} # ] # }, # "active_queries": {...}, # "locks": {"total": 5, "blocked": 0}, # "io_statistics": { # "heap_hit_ratio_percent": 99.2, # "index_hit_ratio_percent": 99.8 # } # } # }
Use Cases:
- Production monitoring dashboards
- Real-time performance troubleshooting
- Lock contention detection
- Connection pool monitoring
Analyze database metrics against configurable alert thresholds.
Parameters:
-
metric_type(string, required): Metric to check (cache_hit_ratio,connection_count,transaction_age,database_size,replication_lag) -
warning_threshold(number, required): Warning threshold value -
critical_threshold(number, required): Critical threshold value -
check_current(boolean, optional): Check current value against thresholds
Returns:
-
metric_type: Metric being monitored -
thresholds: Configured warning and critical thresholds -
current_value: Current metric value -
alert_status: Current status (ok,warning,critical) -
unit: Measurement unit
Example:
# Monitor cache hit ratio result = alert_threshold_set( metric_type="cache_hit_ratio", warning_threshold=95.0, critical_threshold=90.0, check_current=True ) # Returns: { # "metric_type": "cache_hit_ratio", # "thresholds": {"warning": 95.0, "critical": 90.0}, # "current_value": 99.3, # "alert_status": "ok", # "unit": "percent" # } # Monitor connection pool result = alert_threshold_set( metric_type="connection_count", warning_threshold=80, critical_threshold=95, check_current=True )
Supported Metrics:
-
cache_hit_ratio- Buffer cache hit percentage -
connection_count- Active database connections -
transaction_age- Longest running transaction age (seconds) -
database_size- Total database size (bytes) -
replication_lag- Replication lag (seconds, replicas only)
Use Cases:
- Automated alerting systems
- Threshold-based monitoring
- Performance degradation detection
- Capacity warning systems
Analyze database growth trends and project future capacity needs.
Parameters:
-
forecast_days(integer, required): Number of days to forecast -
include_table_growth(boolean, optional): Include per-table growth analysis -
include_index_growth(boolean, optional): Include per-index growth analysis
Returns:
-
current_state: Current database size and top tables/indexes -
projections: Growth forecasts for specified period -
recommendations: Storage and capacity recommendations
Example:
result = capacity_planning( forecast_days=90, include_table_growth=True, include_index_growth=True ) # Returns: { # "current_state": { # "total_size": {"bytes": 50000000000, "gb": 46.57, "pretty": "47 GB"}, # "user_data_size_mb": 35000, # "index_size_mb": 8500, # "top_tables": [ # {"table": "orders", "size_mb": 12500, "row_count": 5000000}, # {"table": "users", "size_mb": 8900, "row_count": 2000000} # ] # }, # "projections": { # "forecast_days": 90, # "estimated_daily_growth_mb": 450, # "estimated_total_growth_gb": 38.6, # "projected_total_size_gb": 85.2 # }, # "recommendations": { # "recommended_storage_gb": 127.8, # "buffer_percentage": 50, # "planning_horizon_days": 90 # } # }
Use Cases:
- Storage capacity planning
- Budget forecasting
- Growth trend analysis
- Infrastructure scaling decisions
Analyze CPU, memory, and I/O resource usage patterns.
Parameters:
-
include_cpu(boolean, optional): Include CPU usage analysis -
include_memory(boolean, optional): Include memory usage analysis -
include_io(boolean, optional): Include I/O usage analysis
Returns:
-
resource_analysis.memory: Buffer cache and shared memory stats -
resource_analysis.io: Disk I/O and cache hit ratios -
resource_analysis.cpu: Query execution time statistics -
recommendations: Resource optimization suggestions
Example:
result = resource_usage_analyze( include_cpu=True, include_memory=True, include_io=True ) # Returns: { # "resource_analysis": { # "memory": { # "shared_buffers": "16384", # "buffer_cache_hit_ratio": 99.5, # "buffer_hits": 5234567, # "disk_reads": 25678 # }, # "io": { # "heap_blocks_from_disk": 12345, # "heap_blocks_from_cache": 987654, # "heap_hit_ratio": 98.8, # "index_blocks_from_disk": 3456, # "index_blocks_from_cache": 654321, # "index_hit_ratio": 99.5 # }, # "cpu": { # "total_execution_time_ms": 125678, # "total_calls": 456789, # "avg_query_time_ms": 0.275, # "max_query_time_ms": 1250.45 # } # }, # "recommendations": [ # { # "category": "memory", # "priority": "info", # "recommendation": "Buffer cache hit ratio excellent at 99.5%" # } # ] # }
Requirements: pg_stat_statements extension for CPU analysis
Use Cases:
- Performance bottleneck identification
- Resource optimization
- Infrastructure right-sizing
- Cost optimization
Monitor replication status, lag, and health for primary and replica databases.
Parameters:
-
include_wal_status(boolean, optional): Include WAL sender/receiver status -
include_slots(boolean, optional): Include replication slot information
Returns:
-
replication_status.is_replica: Whether this is a replica -
replication_status.role: Database role (primary/replica) -
replication_status.wal_senders: WAL sender connections (primary only) -
replication_status.replication_slots: Active replication slots -
lag_info: Replication lag statistics (replica only)
Example:
# On primary database result = replication_monitor( include_wal_status=True, include_slots=True ) # Returns: { # "replication_status": { # "is_replica": False, # "role": "primary", # "wal_senders": { # "count": 2, # "senders": [ # { # "application_name": "replica1", # "client_addr": "10.0.1.5", # "state": "streaming", # "sync_state": "async", # "sent_lsn": "0/5A2F3C0", # "write_lsn": "0/5A2F3C0", # "flush_lsn": "0/5A2F3C0" # } # ] # }, # "replication_slots": { # "total_count": 2, # "inactive_count": 0, # "slots": [...] # } # } # } # On replica database result = replication_monitor(include_wal_status=True) # Returns: { # "replication_status": { # "is_replica": True, # "role": "replica" # }, # "lag_info": { # "receive_lsn": "0/5A2F3C0", # "replay_lsn": "0/5A2F380", # "lag_bytes": 64, # "lag_seconds": 0.05, # "is_replaying": True # } # }
Use Cases:
- High availability monitoring
- Replication health checks
- Lag detection and alerting
- Disaster recovery readiness
# 1. Real-time metrics metrics = monitor_real_time( include_queries=True, include_locks=True, include_io=True ) # 2. Check thresholds cache_status = alert_threshold_set( metric_type="cache_hit_ratio", warning_threshold=95, critical_threshold=90, check_current=True ) conn_status = alert_threshold_set( metric_type="connection_count", warning_threshold=80, critical_threshold=95, check_current=True ) # 3. Resource analysis resources = resource_usage_analyze( include_cpu=True, include_memory=True, include_io=True )
# 1. Analyze growth capacity = capacity_planning( forecast_days=90, include_table_growth=True, include_index_growth=True ) # 2. Current resource usage resources = resource_usage_analyze( include_cpu=True, include_memory=True, include_io=True ) # 3. Project needs # Use capacity["recommendations"]["recommended_storage_gb"] # Use capacity["projections"]["projected_total_size_gb"]
# 1. Check replication status repl_status = replication_monitor( include_wal_status=True, include_slots=True ) # 2. Monitor lag if repl_status["replication_status"]["is_replica"]: lag_alert = alert_threshold_set( metric_type="replication_lag", warning_threshold=5, critical_threshold=30, check_current=True )
- Run
monitor_real_time()every 1-5 minutes - Check
alert_threshold_set()for key metrics - Review
resource_usage_analyze()daily
- Run
capacity_planning()monthly - Track growth trends over time
- Plan upgrades 3 months in advance
Recommended Thresholds:
# Cache hit ratio cache_warning = 95.0 # Below 95% investigate cache_critical = 90.0 # Below 90% urgent action # Connections conn_warning = 80 # 80% of max_connections conn_critical = 95 # 95% of max_connections # Transaction age txn_warning = 300 # 5 minutes txn_critical = 1800 # 30 minutes # Replication lag lag_warning = 5 # 5 seconds lag_critical = 30 # 30 seconds
- Monitor lag every 30 seconds
- Alert on WAL sender disconnections
- Watch for inactive replication slots
# 1. Identify connections metrics = monitor_real_time(include_queries=True) # 2. Analyze queries from Core import get_top_queries slow_queries = get_top_queries(sort_by="calls", limit=20) # 3. Check for connection leaks # Review application connection pooling
# 1. Analyze resource usage resources = resource_usage_analyze(include_memory=True, include_io=True) # 2. Check buffer cache size # Consider increasing shared_buffers in postgresql.conf # 3. Review query patterns # Identify queries causing excessive disk I/O
# 1. Monitor replication repl = replication_monitor(include_wal_status=True) # 2. Check network connectivity # Verify network bandwidth between primary and replica # 3. Analyze replica load # Check if replica is under heavy read load
- Core Database Tools - Basic health monitoring
- Performance Intelligence - Query optimization
- Backup & Recovery - Backup strategies
- Security Best Practices - Secure monitoring
See Home for more tool categories.