quality_assurance · saas · workflow

Databricks coSTAR: automated AI agent testing reduces review cycle from two-week manual reviews to hours

Databricks' AI agent development relied on a slow, manual review-and-fix loop with no comprehensive automated test suite, making it impossible to iterate on agents with confidence as they grew in complexity and scope.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Scenario definition

Each agent test is defined as a scenario — a structured description of the initial state, the user prompt, and the expected outcomes.

Tools used

MLflowcoSTARGEPAMemAlignMCP tools

Outcome

Databricks moved from two-week manual reviews to automated test-and-refine in hours, adopting coSTAR across multiple production agents with tangible benefits, including automated regression detection and saved human effort.

What failed first

The manual review loop failed predictably: without systematic tests, agents could regress silently, and manually QA-ing every change was unsustainable.

Results

Time savedtwo-week manual reviews

Source

https://www.databricks.com/blog/costar-how-we-ship-ai-agents-databricks-fast-without-breaking-things

How we source this →

Grounding & classification

Source type: technical build writeup

27 fields verified against source quotes.

agentic workflowai agentmulti agent workflowquality inspectioncode diff prbuilder submittedfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupquality assuranceai draft human approvalmonitor detect alert