Monitor what actually matters
HTTP 200 responses don't mean your application works. Your health checks should validate:
#!/bin/bash
# Real health check example
mysql -h $DB_HOST -u $USER -p$PASS -e "SELECT 1" > /dev/null || exit 1
response_time=$(curl -w "%{time_total}" -s -o /dev/null $API_ENDPOINT)
if (( $(echo "$response_time > 2.0" | bc -l) )); then exit 1; fi
curl -f $PAYMENT_GATEWAY/health || exit 1
echo "Systems operational"
Test database connections, API performance, and critical business functions. A login endpoint that returns 200 but can't authenticate users is still broken.
Communicate constantly during incidents
Post updates every 15 minutes, even if nothing changed. Use dedicated incident channels, not your general engineering chat. Include:
- Current status
- Actions in progress
- Next steps
- Time estimate
Silence creates panic. Panic creates interruptions. Interruptions extend downtime.
Master zero-downtime migrations
Handle databases with dual-write patterns
Database migrations break most zero-downtime attempts. Use dual-write strategies during cutover:
class OrderService {
public function createOrder($data) {
$result = $this->primaryDb->insert($data);
if ($this->migrationMode) {
try {
$this->newDb->insert($data);
} catch (Exception $e) {
$this->logger->error('Migration write failed', $e);
}
}
return $result;
}
}
Write to both databases simultaneously. Validate data continuously with checksums and record counts. Always plan your rollback strategy before starting.
Route traffic gradually
Never flip 100% of traffic instantly. Start with 1% to your new system, monitor error rates and latency, then increase incrementally. Use feature flags or load balancer weights for control.
Immediate rollback capability is non-negotiable.
Build circuit breakers everywhere
Systems should degrade gracefully, not collapse entirely. Circuit breakers prevent cascade failures. Your checkout should work even if product recommendations fail.
Partial functionality beats complete outages every time.
Create actionable documentation
Write service-specific runbooks
Generic procedures waste precious time during incidents. Build runbooks for each critical service with:
- Common failure symptoms
- Diagnostic commands
- Step-by-step recovery procedures
- Decision trees for different scenarios
Test these during postmortems to keep them current.
Define incident severity levels
| Severity |
Response Time |
Who Gets Notified |
| Critical |
5 minutes |
Everyone |
| High |
15 minutes |
Engineering leads |
| Medium |
1 hour |
Team only |
| Low |
Next day |
Logged for review |
A minor API slowdown shouldn't trigger the same response as a payment system failure.
Test everything beforehand
Zero-downtime migration requires extensive testing with production-like data volumes. Include network latency simulation and dependency failures in your test scenarios.
Test your rollback procedures regularly and document realistic time estimates for each step.
Monitor business metrics, not just infrastructure
Track order completion rates, login success, payment processing alongside traditional server metrics. A successful migration means business functions remain stable, not just that servers stayed up.
Learn from every incident
Conduct blameless postmortems within 48 hours. Focus on process improvements, not individual blame. Update runbooks and monitoring based on lessons learned.
Track recurring issues and fix them permanently rather than repeatedly applying band-aids.
Getting started
Implement escalation paths and better health checks first. These provide immediate value without architectural changes.
Document your existing informal processes next. Many teams already follow some practices but lack written procedures that work when key people are unavailable.
Gradually improve monitoring and alerting, starting with your most critical business functions.
The goal isn't perfection, it's resilience. Build systems and processes that handle failure gracefully rather than trying to prevent every possible issue.
Originally published on binadit.com