incident_management · saas · workflow
Microsoft Research uses LLMs to recommend root cause and mitigation steps for cloud incidents
Hyperscale cloud services like Microsoft 365 face the challenge of quickly detecting incidents and performing root cause analysis and mitigation at scale, with significant engineering effort required for manual resolution.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Incident ticket created
When an incident ticket is created, the author specifies a title and describes relevant details such as error messages and anomalous behavior.
Tools used
GPT-3GPT-3.5IcM
Outcome
Fine-tuned GPT-3.5 substantially outperformed GPT-3 models, improving average lexical similarity by 45.5% for root cause generation and 131.3% for mitigation generation over zero-shot settings; more than 70% of on-call engineers found the recommendations useful.
Results
Volume45.5%
Grounding & classification
Source type: technical build writeup
25 fields verified against source quotes.
content generationsummarizationsupport tickethuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementtechnical build writeupincident managementai draft human approval