Infrastructure-as-Code, CI/CD pipelines, Kubernetes, and multi-cloud automation. Production-grade reliability engineering with a focus on cost efficiency and self-healing systems.
Terraform Kubernetes Docker GitHub Actions Prometheus
Problem: Maintaining deployments, secrets, and compliance across AWS and Azure simultaneously without a unified control plane.
Architecture:
- GitHub Actions for pipeline orchestration
- Terraform modules for AWS (EKS) and Azure (AKS) provisioning
- HashiCorp Vault for dynamic secrets with auto-rotation
- Open Policy Agent (OPA) for pre-deployment policy enforcement (no public S3 buckets, no privileged containers)
- LaunchDarkly for canary/feature-flag releases
- Slack webhooks for deployment notifications
Key decisions:
- Chose Vault over AWS Secrets Manager to stay cloud-agnostic
- OPA policies run as a GitHub Actions step before
terraform applyβ shift-left compliance - Canary deployments roll out to 5% traffic via weighted K8s services before full cutover
Stack: Terraform Β· GitHub Actions Β· Vault Β· OPA Β· Kubernetes Β· Helm Β· Slack API
Problem: Cloud spend spiralling due to idle resources and over-provisioned instances.
Architecture:
- Infracost integrated into GitHub Actions PRs β cost diff shown before merge
- AWS Lambda (scheduled) scans for idle EC2, unattached EBS volumes, and unused RDS snapshots
- Results pushed to a Grafana dashboard (backed by TimescaleDB)
- Slack alerts when weekly spend exceeds defined thresholds
- Auto-generates Terraform
destroyplans for approved idle resources
Key decisions:
- TimescaleDB over plain Postgres for efficient time-series cost queries
- Lambda runs on a cron β no always-on infra cost for the cost tracker itself (irony avoided)
Stack: Terraform Β· Pulumi Β· Infracost Β· AWS Lambda Β· Grafana Β· TimescaleDB Β· Slack API
Problem: Event-driven microservices fail silently under Kafka lag spikes, causing downstream data loss.
Architecture:
- KEDA for Kafka-lag-based autoscaling of consumer pods
- Karpenter for dynamic node provisioning (scale-in within 2min of idle)
- Prometheus + Alertmanager for metrics and alert routing
- ArgoCD for GitOps-based continuous deployment
- Chaos Engineering with Chaos Monkey for periodic failure injection tests
Key decisions:
- KEDA over HPA because HPA can't natively scale on external event sources like Kafka
- ArgoCD's sync waves used to enforce deployment ordering (infra β services β consumers)
Stack: Kubernetes Β· KEDA Β· Karpenter Β· ArgoCD Β· Helm Β· Prometheus Β· Grafana Β· Chaos Monkey
| Area | Technologies |
|---|---|
| IaC | Terraform, Ansible, Pulumi |
| CI/CD | GitHub Actions, Jenkins, ArgoCD |
| Containers | Docker, Kubernetes, Helm, Karpenter |
| Observability | Prometheus, Grafana, Loki, Alertmanager |
| Cloud | AWS (EKS, Lambda, Glue, RDS), GCP (GKE), Azure (AKS) |
| Security | Vault, OPA, SOPS, Trivy |
| Cost | Infracost, AWS Cost Explorer integration |
π§ stephengachoka57@gmail.com | π stephengachoka.co.ke | π Nairobi, Kenya