incident_management · saas · workflow

Meta builds AI-assisted root cause analysis to streamline incident investigations

Investigating issues in systems dependent on monolithic repositories is complex and time-consuming because thousands of code changes across many teams must be searched, and responders must rapidly build context on what is broken and who is impacted.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Investigation created
The workflow begins when an investigation is created, at which point the system begins root cause analysis.
Tools used
LlamaLlama 2Hawkeye
Outcome

The AI-assisted root cause analysis system achieves 42% accuracy in surfacing the true root cause within its top five suggested code changes at investigation creation time for the web monorepo, reducing the effort and time needed to isolate root causes.

Results
Time savedreduce effort and time needed to root cause an investigation significantly
Volume42%
Source

https://engineering.fb.com/2024/06/24/data-infrastructure/leveraging-ai-for-efficient-incident-response/

How we source this →

Grounding & classification
Source type: technical build writeup
24 fields verified against source quotes.
anomaly detectionpredictive analyticsragcode diff prknowledge basefailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementcycle time reductionemployee productivitytechnical build writeupincident managementextract classify routehuman review queue