incident.io builds Workbench, an internal AI evaluation suite for their incident investigation agent
As incident.io moved from tightly focused first-generation AI features to a complex AI agent for incident investigation, triage, and resolution, their existing lightweight tooling was insufficient — it lacked eval suites, graders, and scorecards needed to ensure quality at that scale.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · @incident interaction trigger
Someone interacts via @incident, initiating LLM prompts to classify and score the interaction.
Tools used
WorkbenchLLMGrafanaSonnet 3.7
Outcome
incident.io built Workbench, a bespoke internal AI evaluation suite that enabled rapid iteration, a single pane of glass for debugging LLM interactions, and privacy-preserving performance analysis of their Investigations agent without exposing customer data to staff.
What failed first
Off-the-shelf AI tooling options existed but were rejected because relying on vendor marketing rather than first-hand experience risked adopting a product built for a different team context, which would have caused the team to skip learning AI engineering from first principles.