incident_management · saas · workflow

How Datadog Built Bits AI SRE: An Autonomous Incident Investigation Agent That Reduces Time to Resolution by Up to 95%

As distributed systems grow more dynamic and complex, production incidents span more services, involve noisier signals, and generate larger volumes of telemetry data, making it hard for on-call engineers to find root causes quickly.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Monitor alert triggers investigation
Bits AI SRE automatically investigates incidents and monitor alerts when they fire.
Tools used
Bits AI SRE
Outcome

Bits AI SRE decreases time to resolution by up to 95% and has received overwhelmingly positive feedback from customers who observed reduced time to root cause detection for complex incidents.

What failed first

Early SRE agents performed many tool calls and summarized all telemetry at once, causing token counts to scale linearly with complexity, which degraded model performance and led to incorrect root cause identification when noisy signals distracted the summarization prompt.

Results
Time savedup to 95%
Source

https://www.datadoghq.com/blog/building-bits-ai-sre/

How we source this →

Grounding & classification
Source type: technical build writeup
18 fields verified against source quotes.
agentic workflowai agentbuilder submittedfailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionresolution time reductiontechnical build writeupincident managementautonomous resolutionmonitor detect alert