quality_assurance · saas · workflow

Notion uses Braintrust to deploy frontier AI models within hours and keep 70 engineers aligned on evaluations

As Notion's AI grew from simple prompt chains to agentic workflows with combinatorial evaluation paths, quality problems became hard to find at scale, and existing databases began breaking under the load of large LLM traces.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Customer experience review

Reviewing the worst customer experiences in Braintrust set the foundation for Notion's evaluation practices.

Tools used

BraintrustBrainstore

Outcome

Notion now deploys frontier AI models within hours of release, with 80% of AI team work grounded in Braintrust evaluation feedback, 70 engineers aligned on evaluation practices, and meaningful quality improvements for APAC multilingual customers.

What failed first

Before Braintrust, quality problems at scale went unidentified, and as AI prompts grew to hundreds of thousands of tokens, standard search was too slow to navigate massive traces.

Results

Time savedwithin hours of release

Volume80%

Source

https://www.braintrust.dev/blog/notion

How we source this →

Grounding & classification

Source type: vendor customer story

20 fields verified against source quotes.

agentic workflowknowledge basehuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementcycle time reductionemployee productivityvendor customer storyquality assurancemonitor detect alert