The challenge
SolarWinds was generating thousands of alerts per week, most of them ignored. Engineers had stopped trusting it, and a real incident could be lost in the noise. A previous attempt to move off it had stalled after stakeholders pushed back on a planned outage.
What we built
We stood up Prometheus for metrics, Loki for logs, and Grafana as the unified dashboard and alerting surface, then ran it in parallel with SolarWinds. Coverage was validated asset class by asset class, starting with non-critical infrastructure and ending with revenue-impacting systems, with a documented rollback path at each stage. Alert thresholds were re-derived from actual incident history rather than vendor defaults.
What changed
Alert volume dropped by an order of magnitude while real incidents now route to the right team within minutes. The on-call rotation reports a noticeable drop in fatigue, and the engineering team trusts the alerts again. SolarWinds was decommissioned without a maintenance window.
Stack & partners
- Grafana
- Prometheus
- Loki
- Staged parallel-run migration