quality_assurance · saas · workflow
Zapier iterates AI products from sub-50% to 90%+ accuracy using Braintrust evals
AI teams commonly get stuck after shipping a v1 because there is no reliable way to know whether a prompt or code change improves overall performance or introduces regressions elsewhere.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Validate idea with frontier models
Zapier validates AI feature ideas by quickly cycling through prompts using GPT-4 Turbo and Claude Opus.
Tools used
BraintrustGPT-4 TurboClaude OpusGPT-4o
Outcome
Using an eval-driven feedback loop through Braintrust, Zapier improved many of their AI products from sub-50% accuracy to 90%+ within 2-3 months.
Results
Time saved2-3 months
Volumesub-50% accuracy to 90%+
Running sinceAugust 2023
Grounding & classification
Source type: vendor customer story
15 fields verified against source quotes, 1 dropped as unverifiable.
metric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementvendor customer storyquality assuranceai draft human approval