incident_management · saas · workflow

Datadog builds a replayable evaluation platform for Bits AI SRE to catch agent regressions

As Datadog built Bits AI SRE, improvements in one area could quietly introduce regressions in another with no reliable way to detect them, and the team had no way to replay real production context, measure behavior consistently across diverse incidents, or track whether the agent was improving over time.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Customer feedback triggers label

When customers provide feedback on a Bits AI investigation, that signal along with investigation information is used to construct a ground truth root cause analysis and world snapshot.

Tools used

Bits AI SREDatadog LLM ObservabilityClaude Opus 4.5

Outcome

The evaluation platform scaled label creation by an order of magnitude, reduced label validation time by more than 95%, improved root cause quality by roughly 30%, and now runs Bits against tens of thousands of scenarios drawn from real incidents every week.

What failed first

Testing individual tools in isolation failed because agent failures emerged from interactions between steps rather than single tool calls. Live replay of Bits investigations also did not scale because results were not aggregated, environments changed, and signals expired.

Results

Time savedmore than 95%

Volumeincreased our label creation rate by an order of magnitude

Source

https://www.datadoghq.com/blog/engineering/bits-ai-eval-platform/

How we source this →

Grounding & classification

Source type: technical build writeup

32 fields verified against source quotes.

agentic workflowai agentanomaly detectionknowledge basebuilder submittedfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductionthroughput increasetime savedtechnical build writeupincident managementquality assuranceagentic task executionai draft human approvalmonitor detect alert