Disaster Recovery SOP¶

Version: 1.0
Owner: DevOps Team

Purpose¶

Define the disaster recovery plan including RTO/RPO, failover procedures, and recovery drills.

Definitions¶

Term	Meaning
RTO	Recovery Time Objective — max acceptable downtime
RPO	Recovery Point Objective — max acceptable data loss
DR	Disaster Recovery
HA	High Availability
MTTR	Mean Time to Recover

Objectives¶

Tier	RTO	RPO	Example
Critical	< 1 hour	< 15 min	Production websites, payment
High	< 4 hours	< 1 hour	Customer-facing apps
Medium	< 24 hours	< 24 hours	Internal tools
Low	< 72 hours	< 72 hours	Logs, archives

Disaster Scenarios¶

Scenario	Response
Server failure	Failover to standby / restore from backup
Data corruption	Restore database from latest backup
Security breach	Isolate server, restore from clean backup
Region outage	DNS switch to DR region
Accidental deletion	Restore from backup (hourly/daily)

Procedure¶

1. Detection & Declaration¶

# Failed health checks
curl -f https://example.com/health || echo "SITE DOWN"

# Monitoring alert
# On-call engineer:

# 1. Confirm outage
# 2. Declare DR event in #incidents channel
# 3. Notify DR team

2. Assess Impact¶

Which services are affected?
Estimated RTO remaining?
Is this a partial or full disaster?
Can it be resolved without full DR?

3. Activate DR Plan¶

Server Failure¶

# Option A: Spare server
# Update DNS to standby IP
# Mount latest backup volumes

# Option B: Restore from backup
# Provision new server
# Restore latest backup

Database Failure¶

# Restore from S3 backup
aws s3 cp s3://backups/db/latest.sql .
mysql -u root -p < latest.sql

# Or promote replica
mysql -e "STOP SLAVE;"
mysql -e "RESET SLAVE ALL;"
# Update app config to point to promoted replica

Full Region Outage (Cloud)¶

# Deploy to secondary region
terraform apply -var-file=dr.tfvars

# Update DNS to DR region
# Verify failover

4. Failover Verification¶

# Check service health
curl -I https://example.com

# Check database
php -r "new PDO('mysql:host=...;dbname=...', 'user', 'pass'); echo 'OK\n';"

# Check monitoring

5. Restoration¶

Once primary is restored:

# Sync data back
rsync -avz dr-server:/data /data

# Switch DNS back to primary
# Verify primary is healthy
# Decommission DR resources

DR Testing Schedule¶

Test Type	Frequency	Scope
Database restore	Monthly	Single DB restore
Server rebuild	Quarterly	Full server provisioning
DNS failover	Bi-annually	Full DR drill
Region failover	Annually	Cross-region DR

DR Drill Checklist¶

DR Kit Location¶

Item	Location
Backups	S3 (encrypted, cross-region)
Terraform state	S3 backend
Docker images	Registry (multi-region)
Configuration	Vault / encrypted repo
DR runbook	This document
Access credentials	Password manager

Recovery Scripts¶

Store automation scripts in scripts/dr/:

scripts/dr/
├── restore-db.sh        # Database restore
├── failover-dns.sh      # DNS update
├── provision-server.sh  # Terraform apply
└── verify-health.sh     # Health checks

Verification¶

RTO/RPO defined for each service tier
Backups stored off-site / cross-region
DR plan tested within last 6 months
Recovery scripts work
DR contact list current
DR documentation reviewed annually