ticket_triage · workflow

incident.io builds reliable AI into incident management with evals infrastructure and an AI SRE product

Building reliable AI features for a reliability-critical product proved far harder than building prototypes — AI's non-determinism meant 'mostly right' was unacceptable when customers make real-world decisions during critical incidents, and scaling efforts caused regressions in one area whenever another improved.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · LLM release triggers exploration

The release of GPT-3.5 and ChatGPT prompted the team to explore what AI could mean for their product.

Tools used

Claude CodeSlackScribeInvestigations

Outcome

incident.io built internal evals, scoring frameworks, backtests, and training sets that made their AI product genuinely dependable, and now ships automatically generated post-mortems, incident summaries, a dashboard Q&A, and a new AI SRE product aimed at substantially reducing downtime and noise.

What failed first

Early AI features worked as demos but failed to perform consistently across all customers, datasets, and question phrasings, immediately eroding user trust when failures occurred.

Results

Time savedsubstantially reduce downtime

Source

https://incident.io/building-with-ai/weaving-ai-into-the-fabric-of-incident-io

How we source this →

Grounding & classification

Source type: technical build writeup

25 fields verified against source quotes.

agentic workflowcontent generationconversational aisummarizationknowledge basesupport ticketfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupit supportticket triageagentic task execution