quality_assurance · saas · workflow
Notion uses Braintrust to deploy frontier AI models within hours and keep 70 engineers aligned on evaluations
As Notion's AI grew from simple prompt chains to agentic workflows with combinatorial evaluation paths, quality problems became hard to find at scale, and existing databases began breaking under the load of large LLM traces.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Customer experience review
Reviewing the worst customer experiences in Braintrust set the foundation for Notion's evaluation practices.
Tools used
BraintrustBrainstore
Outcome
Notion now deploys frontier AI models within hours of release, with 80% of AI team work grounded in Braintrust evaluation feedback, 70 engineers aligned on evaluation practices, and meaningful quality improvements for APAC multilingual customers.
What failed first
Before Braintrust, quality problems at scale went unidentified, and as AI prompts grew to hundreds of thousands of tokens, standard search was too slow to navigate massive traces.
Results
Time savedwithin hours of release
Volume80%
Grounding & classification
Source type: vendor customer story
20 fields verified against source quotes.
agentic workflowknowledge basehuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementcycle time reductionemployee productivityvendor customer storyquality assurancemonitor detect alert