ticket_triage · workflow

incident.io builds reliable AI into incident management with evals infrastructure and an AI SRE product

Building reliable AI features for a reliability-critical product proved far harder than building prototypes — AI's non-determinism meant 'mostly right' was unacceptable when customers make real-world decisions during critical incidents, and scaling efforts caused regressions in one area whenever another improved.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · LLM release triggers exploration
The release of GPT-3.5 and ChatGPT prompted the team to explore what AI could mean for their product.
Tools used
Claude CodeSlackScribeInvestigations
Outcome

incident.io built internal evals, scoring frameworks, backtests, and training sets that made their AI product genuinely dependable, and now ships automatically generated post-mortems, incident summaries, a dashboard Q&A, and a new AI SRE product aimed at substantially reducing downtime and noise.

What failed first

Early AI features worked as demos but failed to perform consistently across all customers, datasets, and question phrasings, immediately eroding user trust when failures occurred.

Results
Time savedsubstantially reduce downtime
Source

https://incident.io/building-with-ai/weaving-ai-into-the-fabric-of-incident-io

How we source this →

Grounding & classification
Source type: technical build writeup
25 fields verified against source quotes.
agentic workflowcontent generationconversational aisummarizationknowledge basesupport ticketfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupit supportticket triageagentic task execution