quality_assurance · saas · workflow

Zapier iterates AI products from sub-50% to 90%+ accuracy using Braintrust evals

AI teams commonly get stuck after shipping a v1 because there is no reliable way to know whether a prompt or code change improves overall performance or introduces regressions elsewhere.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Validate idea with frontier models
Zapier validates AI feature ideas by quickly cycling through prompts using GPT-4 Turbo and Claude Opus.
Tools used
BraintrustGPT-4 TurboClaude OpusGPT-4o
Outcome

Using an eval-driven feedback loop through Braintrust, Zapier improved many of their AI products from sub-50% accuracy to 90%+ within 2-3 months.

Results
Time saved2-3 months
Volumesub-50% accuracy to 90%+
Running sinceAugust 2023
Source

https://www.braintrust.dev/blog/zapier-ai

How we source this →

Grounding & classification
Source type: vendor customer story
15 fields verified against source quotes, 1 dropped as unverifiable.
metric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementvendor customer storyquality assuranceai draft human approval