Broken pipelines are not fate.
There's a fix around the corner π
That exec pinging you about a broken dashboard before the board meeting? Thatβs what happens too often when we treat data downtime reactively instead of strategically.
That creates a whole can of worms: poor decisions, wasted engineering hours, and lost business confidence with people looking at you sideways in the office lobby. π
The solution? Taking inspiration from reliability engineering:
π SLAs (Service Level Agreements) = Your plain-language promise to business ("Dashboard updates daily, revenue figures accurate")
π― SLOs (Service Level Objectives) = Internal engineering targets ("Pipeline completes within 23hrs, 99.5% of time")
π SLIs (Service Level Indicators) = Direct metrics you monitor (completion time, test pass rates)
To kickstart things, you could start by defining 2 SLAs for your most critical dashboard. For example:
Freshness: "Data no more than 24hrs old"
Accuracy: "Revenue never negative or drops >90% without known cause"
This would force concrete conversations about what "good data" actually means to the business. And after a few months, you could implement SLOs/SLIs with observability tools (like basic dbt tests or something built for this like Monte Carlo or Sifflet) to automate freshness + accuracy checks. Use those metrics to build your business case for Data Reliability Engineering across critical assets.
If you want to learn more about this approach, watch Miriah Petersonβs full talk at Data Council: Scaling Data Reliably: a Journey in Growing Through Data Pain Points π


