Datadog builds a replayable evaluation platform for Bits AI SRE to catch agent regressions
As Datadog built Bits AI SRE, improvements in one area could quietly introduce regressions in another with no reliable way to detect them, and the team had no way to replay real production context, measure behavior consistently across diverse incidents, or track whether the agent was improving over time.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Customer feedback triggers label
When customers provide feedback on a Bits AI investigation, that signal along with investigation information is used to construct a ground truth root cause analysis and world snapshot.
Tools used
Bits AI SREDatadog LLM ObservabilityClaude Opus 4.5
Outcome
The evaluation platform scaled label creation by an order of magnitude, reduced label validation time by more than 95%, improved root cause quality by roughly 30%, and now runs Bits against tens of thousands of scenarios drawn from real incidents every week.
What failed first
Testing individual tools in isolation failed because agent failures emerged from interactions between steps rather than single tool calls. Live replay of Bits investigations also did not scale because results were not aggregated, environments changed, and signals expired.
Results
Time savedmore than 95%
Volumeincreased our label creation rate by an order of magnitude