incident.io builds reliable AI into incident management with evals infrastructure and an AI SRE product
Building reliable AI features for a reliability-critical product proved far harder than building prototypes — AI's non-determinism meant 'mostly right' was unacceptable when customers make real-world decisions during critical incidents, and scaling efforts caused regressions in one area whenever another improved.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · LLM release triggers exploration
The release of GPT-3.5 and ChatGPT prompted the team to explore what AI could mean for their product.
Tools used
Claude CodeSlackScribeInvestigations
Outcome
incident.io built internal evals, scoring frameworks, backtests, and training sets that made their AI product genuinely dependable, and now ships automatically generated post-mortems, incident summaries, a dashboard Q&A, and a new AI SRE product aimed at substantially reducing downtime and noise.
What failed first
Early AI features worked as demos but failed to perform consistently across all customers, datasets, and question phrasings, immediately eroding user trust when failures occurred.