Production checklist for incident management and zero downtime migration

DEV Community

Monitor what actually matters

HTTP 200 responses don't mean your application works. Your health checks should validate:

#!/bin/bash
# Real health check example
mysql -h $DB_HOST -u $USER -p$PASS -e "SELECT 1" > /dev/null || exit 1
response_time=$(curl -w "%{time_total}" -s -o /dev/null $API_ENDPOINT)
if (( $(echo "$response_time > 2.0" | bc -l) )); then exit 1; fi
curl -f $PAYMENT_GATEWAY/health || exit 1
echo "Systems operational"

Test database connections, API performance, and critical business functions. A login endpoint that returns 200 but can't authenticate users is still broken.

Communicate constantly during incidents

Post updates every 15 minutes, even if nothing changed. Use dedicated incident channels, not your general engineering chat. Include:

Current status
Actions in progress
Next steps
Time estimate

Silence creates panic. Panic creates interruptions. Interruptions extend downtime.

Master zero-downtime migrations

Handle databases with dual-write patterns

Database migrations break most zero-downtime attempts. Use dual-write strategies during cutover:

class OrderService {
 public function createOrder($data) {
 $result = $this->primaryDb->insert($data);
 if ($this->migrationMode) {
 try {
 $this->newDb->insert($data);
 } catch (Exception $e) {
 $this->logger->error('Migration write failed', $e);
 }
 }
 return $result;
 }
}

Write to both databases simultaneously. Validate data continuously with checksums and record counts. Always plan your rollback strategy before starting.

Route traffic gradually

Never flip 100% of traffic instantly. Start with 1% to your new system, monitor error rates and latency, then increase incrementally. Use feature flags or load balancer weights for control.

Immediate rollback capability is non-negotiable.

Build circuit breakers everywhere

Systems should degrade gracefully, not collapse entirely. Circuit breakers prevent cascade failures. Your checkout should work even if product recommendations fail.

Partial functionality beats complete outages every time.

Create actionable documentation

Write service-specific runbooks

Generic procedures waste precious time during incidents. Build runbooks for each critical service with:

Common failure symptoms
Diagnostic commands
Step-by-step recovery procedures
Decision trees for different scenarios

Test these during postmortems to keep them current.

Define incident severity levels

Severity	Response Time	Who Gets Notified
Critical	5 minutes	Everyone
High	15 minutes	Engineering leads
Medium	1 hour	Team only
Low	Next day	Logged for review

A minor API slowdown shouldn't trigger the same response as a payment system failure.

Test everything beforehand

Zero-downtime migration requires extensive testing with production-like data volumes. Include network latency simulation and dependency failures in your test scenarios.

Test your rollback procedures regularly and document realistic time estimates for each step.

Monitor business metrics, not just infrastructure

Track order completion rates, login success, payment processing alongside traditional server metrics. A successful migration means business functions remain stable, not just that servers stayed up.

Learn from every incident

Conduct blameless postmortems within 48 hours. Focus on process improvements, not individual blame. Update runbooks and monitoring based on lessons learned.

Track recurring issues and fix them permanently rather than repeatedly applying band-aids.

Getting started

Implement escalation paths and better health checks first. These provide immediate value without architectural changes.

Document your existing informal processes next. Many teams already follow some practices but lack written procedures that work when key people are unavailable.

Gradually improve monitoring and alerting, starting with your most critical business functions.

The goal isn't perfection, it's resilience. Build systems and processes that handle failure gracefully rather than trying to prevent every possible issue.

Originally published on binadit.com