Case study · B2B SaaS · Kubernetes · 4 months
SaaS Migration — MTTR Down 45%
SLOs, observability, and automated runbooks accelerated recovery and confidence.
Context
- Lift-and-shift migration exposed hidden coupling and alert fatigue.
- Missing SLOs made prioritization and stakeholder communication difficult.
Actions
- Defined user-centric SLOs and visualized error budgets per service.
- Added distributed tracing and log sampling; removed noisy alerts.
- Codified incident runbooks; introduced game days and postmortems.
Process
Week 1–2: Discovery & SLO Definition
- Stakeholder interviews to map critical user journeys and reliability goals.
- Drafted SLIs for availability, latency, and correctness for the top 3 journeys.
- Aligned targets with product and support; published initial SLOs with error budgets.
Week 3–4: Observability Baseline
- Implemented tracing across gateway and top services; standardized log fields.
- Set exemplar dashboards for latency distributions and saturation signals.
- Introduced synthetic checks for critical workflows and regions.
Week 5–7: Alert Hygiene & On‑call
- Eliminated non-actionable alerts; tied alerts to SLO burn rates.
- Created escalation policies and ownership maps; rotated shadow on‑call.
- Documented runbooks for top 5 recurring incidents with verification steps.
Week 8–10: Resilience & Cost
- Introduced circuit breakers and bulkheads on noisy dependencies.
- Right-sized workloads; moved batch jobs off peak to cut spend 18%.
- Established change freeze windows and feature flags for risky deployments.
Week 11–12: Game Day & Handover
- Ran failure injection scenarios; measured detection, diagnosis, recovery.
- Closed gaps from postmortems; trained leads on SLO reviews and budget policies.
- Handover checklist and 90‑day reliability roadmap approved by stakeholders.
Results
- MTTR reduced 45% with clearer on-call guidance.
- Alert noise down 60%; engineers report higher focus time.
- All four top-level SLOs consistently met for two quarters.
-45%
MTTR
-60%
Alert noise
4/4
SLOs met