Snorkel AI benchmark evaluates frontier model AI agents for insurance underwriting across task types and error modes
AI agents in enterprise settings are often inaccurate and inefficient because they are not tuned to the critical details of enterprise problems, while AI research has focused on generic use cases that do not translate to enterprise settings.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Underwriter submits application info
A junior underwriter provides occasionally incomplete information about an insurance applicant to the AI copilot.
Tools used
LangGraphModel Context Protocol (MCP)ReActSnorkel's evaluation suite
Outcome
The benchmark revealed a wide accuracy range from the single digits up to approximately 80% across frontier models, with even the three most accurate models making tool call errors in 30-50% of conversations, illustrating that top frontier models struggle in surprising ways with proprietary enterprise knowledge.
What failed first
Frontier models made tool call errors in 36% of conversations despite having the metadata needed to use tools correctly, and top OpenAI models hallucinated insurance products not in the provided guidelines 15-45% of the time, with hallucinations also producing misleading questions to the underwriter.