Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

helmcode/k8s-watchdog-ai

Repository files navigation

K8s Watchdog AI πŸ•

Autonomous Kubernetes cluster observability with AI-powered weekly health reports

Python License: MIT FastAPI

An intelligent Kubernetes monitoring agent that uses Claude AI to autonomously investigate cluster health, analyze metrics from Prometheus, and generate comprehensive weekly PDF reports delivered via Slack.

✨ Features

  • πŸ€– AI-Powered Analysis: Claude AI autonomously investigates cluster issues using direct Python tools
  • πŸ“Š Prometheus Integration: Analyzes metrics to detect resource inefficiencies (optional)
  • πŸ”’ Read-Only by Design: All operations are read-only for safety
  • πŸ“„ PDF Reports: Professional HTML reports converted to PDF via WeasyPrint
  • πŸ“§ Slack Integration: Reports delivered via Slack with detailed tool usage information
  • πŸ—„οΈ Historical Tracking: SQLite storage for report history
  • πŸš€ REST API: FastAPI server for on-demand report generation
  • ⚑ Graceful Degradation: Works with or without Prometheus

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Kubernetes Cluster β”‚
β”‚ β”‚
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚ β”‚ K8s Watchdog AI (FastAPI) β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Claude AI Agent β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Autonomous investigation β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Tool selection & execution β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Report generation β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”β”‚ β”‚
β”‚ β”‚ β”‚ Kubernetes Tools β”‚ β”‚ Prometheus β”‚β”‚ β”‚
β”‚ β”‚ β”‚ - get pods/nodes β”‚ β”‚ Tools β”‚β”‚ β”‚
β”‚ β”‚ β”‚ - describe β”‚ β”‚ - query β”‚β”‚ β”‚
β”‚ β”‚ β”‚ - logs β”‚ β”‚ - range query β”‚β”‚ β”‚
β”‚ β”‚ β”‚ - events β”‚ β”‚ - memory/cpu β”‚β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚ β”‚
β”‚ β”‚ β”‚ β”‚
β”‚ β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ β”‚
β”‚ β”‚ β”‚ Report Generator & Storage β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - WeasyPrint (HTML β†’ PDF) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - SQLite (history) β”‚ β”‚ β”‚
β”‚ β”‚ β”‚ - Slack Files API v2 β”‚ β”‚ β”‚
β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 β”‚
 β–Ό
 Slack Webhook

πŸš€ Quick Start

Prerequisites

  • Kubernetes cluster with kubectl access (or kubeconfig for local development)
  • Anthropic API key (Get one here)
  • Slack webhook URL (Create one)
  • Slack Bot Token and Channel ID for file uploads (Create bot)
  • Prometheus running in cluster (optional - reports work without it)

Local Development with Docker Compose

# 1. Clone the repository
git clone https://github.com/helmcode/k8s-watchdog-ai.git
cd k8s-watchdog-ai
# 2. Copy and configure environment
cp .env.example .env
# Edit .env with your API keys and settings
# 3. Run with docker-compose
docker-compose up -d
# 4. Trigger report generation
curl -X POST http://localhost:8000/report
# 5. Check status
curl http://localhost:8000/health
# 6. View logs
docker-compose logs -f

Deploy to Kubernetes with Helm

# 1. Store secrets in Vault (if using Vault)
vault kv put helmcode_platform/k8s_watchdog_ai \
 ANTHROPIC_API_KEY="sk-ant-..." \
 SLACK_WEBHOOK_URL="https://hooks.slack.com/..." \
 SLACK_BOT_TOKEN="xoxb-..." \
 SLACK_CHANNEL="C123456789"
# 2. Install with Helm
helm install k8s-watchdog-ai ./helm \
 --namespace watchdog-ai \
 --create-namespace \
 --values ./helm/values/prod.yaml
# 3. Verify deployment
kubectl get pods -n watchdog-ai
kubectl logs -f deployment/k8s-watchdog-ai -n watchdog-ai

For detailed Helm deployment instructions, see helm/README.md.

Deploy with ArgoCD

# Apply ArgoCD Application
kubectl apply -f helm/argocd/application.yaml
# Monitor deployment
argocd app get k8s-watchdog-ai

For ArgoCD configuration details, see helm/argocd/README.md.

βš™οΈ Configuration

Variable Required Default Description
ANTHROPIC_API_KEY βœ… - Claude API key
ANTHROPIC_MODEL ❌ claude-sonnet-4-20250514 AI model to use
SLACK_WEBHOOK_URL βœ… - Slack webhook for messages
SLACK_BOT_TOKEN βœ… - Bot token for file uploads
SLACK_CHANNEL βœ… - Channel ID (e.g., C123456789)
PROMETHEUS_URL ❌ http://prometheus:9090 Prometheus server URL
CLUSTER_NAME ❌ default Cluster identifier
CLIENT_NAME ❌ default Client/customer name
EXCLUDED_NAMESPACES ❌ kube-system,kube-public,... Namespaces to exclude
REPORT_LANGUAGE ❌ spanish Report language (spanish/english)
JOB_POLL_INTERVAL ❌ 5 Seconds between queue polls
JOB_MAX_RETRIES ❌ 3 Max retry attempts for failed jobs
SQLITE_PATH ❌ /app/data/reports.db SQLite database path
LOG_LEVEL ❌ INFO Logging level

See .env.example for complete list.

πŸ“‹ How It Works

  1. FastAPI Server: Runs continuously, exposing /report and /health endpoints
  2. Trigger: Can be called via HTTP POST or scheduled with Kubernetes CronJob
  3. AI Investigation:
    • Claude receives a system prompt with available tools
    • Agent autonomously decides what to investigate
    • Makes iterative queries to Kubernetes and Prometheus (if available)
  4. Analysis: AI analyzes cluster health, resource usage, and metrics
  5. Report Generation: Creates HTML report, converts to PDF with WeasyPrint
  6. Delivery: Uploads PDF to Slack with detailed tool usage information
  7. Storage: Saves report to SQLite for history tracking

Example AI Investigation Flow

Claude: "Let me check the overall pod status"
β†’ Calls: kubectl_get_pods(namespace="default", all_namespaces=True)
Claude: "I see pod X has 15 restarts. Let me investigate"
β†’ Calls: kubectl_describe_pod(pod="X", namespace="production")
β†’ Calls: kubectl_get_pod_logs(pod="X", namespace="production", tail=100)
Claude: "This looks like OOMKilled. Let me check memory metrics"
β†’ Calls: prometheus_check_pod_memory(pod="X", namespace="production")
β†’ Calls: prometheus_query(query="container_memory_working_set_bytes{pod='X'}")
Claude: "Memory usage is consistently above request. Recommending increase"
β†’ Generates HTML report with specific recommendations
β†’ Report includes: issue analysis, metrics charts, action plan

Tool Availability Detection

The system intelligently handles tool availability:

βœ… Kubernetes API: 5 tool types used
 β€’ Tools: kubectl_describe_pod, kubectl_get_deployments, kubectl_get_events, ...
❌ Prometheus: Connection failed
 β€’ Prometheus not available: All connection attempts failed
 
 i️ Report generated using Kubernetes data only

πŸ“Š Report Structure

Reports include:

  1. Executive Summary: Overall health status (πŸŸ’πŸŸ‘πŸ”΄)
  2. Top Issues: 3-5 critical problems with severity levels
  3. Resource Analysis: Over/under-provisioned workloads
  4. Prometheus Metrics: CPU, memory, disk usage (when available)
  5. Action Plan: Prioritized, actionable recommendations
  6. Footer: Generated by Watchdog AI - Helmcode

The PDF report is accompanied by a Slack message showing:

  • Report generation time
  • Data sources used (Kubernetes API, Prometheus)
  • Tool usage statistics
  • Connection status for each service

πŸ› οΈ Development

# Install dependencies
pip install -e ".[dev]"
# Run locally (requires kubeconfig)
python -m src.main
# Format code
black src/
ruff check src/
# Type check
mypy src/
# Build Docker image
docker build -t k8s-watchdog-ai:latest .

πŸ” Security

  • Read-only access: All operations are read-only (get, list, watch, describe, logs)
  • RBAC: Minimal permissions required in Kubernetes
  • No cluster modifications: Agent cannot modify cluster state
  • Secrets management: Kubernetes secrets for sensitive data
  • Connection errors: Gracefully handles unavailable services

πŸ“š API Endpoints

  • POST /report - Generate and send report immediately (returns 202 Accepted)
  • GET /health - Health check endpoint
  • GET /stats - Report generation statistics

πŸ“š Documentation

🀝 Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Test thoroughly
  5. Submit a pull request

πŸ“„ License

MIT License - see LICENSE for details.

πŸ™ Acknowledgments


Made with ❀️ by Helmcode

About

Autonomous Kubernetes cluster observability with AI-powered

Resources

License

Stars

Watchers

Forks

Packages

Contributors

AltStyle γ«γ‚ˆγ£γ¦ε€‰ζ›γ•γ‚ŒγŸγƒšγƒΌγ‚Έ (->γ‚ͺγƒͺγ‚ΈγƒŠγƒ«) /