Snorkel AI builds expert-verified benchmark dataset for evaluating AI agents in insurance underwriting
Enterprise AI agents applied to specialized domains like insurance underwriting are often inaccurate and inefficient because AI R&D has focused on easily verifiable settings with plentiful data, leaving specialized domains without quality benchmarks or expert-validated evaluation.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Underwriter submits applicant info
A junior underwriter with occasionally incomplete applicant information requests help from the AI copilot.
The benchmark revealed a wide range of model accuracies from single digits to approximately 80%, with actionable granular insights into error modes including tool use failures and domain-knowledge hallucinations, enabling targeted model development.
What failed first
Frontier AI models demonstrated significant failure modes on the insurance underwriting benchmark: tool call errors occurred in 36% of conversations even for top-performing models, and OpenAI models hallucinated insurance products not present in the provided guidelines in 15-45% of product recommendation conversations.