Case study · B2B SaaS · Kubernetes · 4 months

SaaS Migration — MTTR Down 45%

SLOs, observability, and automated runbooks accelerated recovery and confidence.

SaaS Migration — MTTR Down 45%

Context

  • Lift-and-shift migration exposed hidden coupling and alert fatigue.
  • Missing SLOs made prioritization and stakeholder communication difficult.

Actions

  • Defined user-centric SLOs and visualized error budgets per service.
  • Added distributed tracing and log sampling; removed noisy alerts.
  • Codified incident runbooks; introduced game days and postmortems.

Process

Week 1–2: Discovery & SLO Definition

  • Stakeholder interviews to map critical user journeys and reliability goals.
  • Drafted SLIs for availability, latency, and correctness for the top 3 journeys.
  • Aligned targets with product and support; published initial SLOs with error budgets.

Week 3–4: Observability Baseline

  • Implemented tracing across gateway and top services; standardized log fields.
  • Set exemplar dashboards for latency distributions and saturation signals.
  • Introduced synthetic checks for critical workflows and regions.

Week 5–7: Alert Hygiene & On‑call

  • Eliminated non-actionable alerts; tied alerts to SLO burn rates.
  • Created escalation policies and ownership maps; rotated shadow on‑call.
  • Documented runbooks for top 5 recurring incidents with verification steps.

Week 8–10: Resilience & Cost

  • Introduced circuit breakers and bulkheads on noisy dependencies.
  • Right-sized workloads; moved batch jobs off peak to cut spend 18%.
  • Established change freeze windows and feature flags for risky deployments.

Week 11–12: Game Day & Handover

  • Ran failure injection scenarios; measured detection, diagnosis, recovery.
  • Closed gaps from postmortems; trained leads on SLO reviews and budget policies.
  • Handover checklist and 90‑day reliability roadmap approved by stakeholders.

Results

  • MTTR reduced 45% with clearer on-call guidance.
  • Alert noise down 60%; engineers report higher focus time.
  • All four top-level SLOs consistently met for two quarters.
-45%
MTTR
-60%
Alert noise
4/4
SLOs met
SaaS Migration — MTTR Down 45% | Maxwell Software Solutions