quality_assurance · saas · workflow

Zapier iterates AI products from sub-50% to 90%+ accuracy using Braintrust evals

AI teams commonly get stuck after shipping a v1 because there is no reliable way to know whether a prompt or code change improves overall performance or introduces regressions elsewhere.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Validate idea with frontier models

Zapier validates AI feature ideas by quickly cycling through prompts using GPT-4 Turbo and Claude Opus.

Tools used

BraintrustGPT-4 TurboClaude OpusGPT-4o

Outcome

Using an eval-driven feedback loop through Braintrust, Zapier improved many of their AI products from sub-50% accuracy to 90%+ within 2-3 months.

Results

Time saved2-3 months

Volumesub-50% accuracy to 90%+

Running sinceAugust 2023

Source

https://www.braintrust.dev/blog/zapier-ai

How we source this →

Grounding & classification

Source type: vendor customer story

15 fields verified against source quotes, 1 dropped as unverifiable.

metric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementvendor customer storyquality assuranceai draft human approval