quality_assurance · saas · workflow
Databricks coSTAR: automated AI agent testing reduces review cycle from two-week manual reviews to hours
Databricks' AI agent development relied on a slow, manual review-and-fix loop with no comprehensive automated test suite, making it impossible to iterate on agents with confidence as they grew in complexity and scope.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Scenario definition
Each agent test is defined as a scenario — a structured description of the initial state, the user prompt, and the expected outcomes.
Tools used
MLflowcoSTARGEPAMemAlignMCP tools
Outcome
Databricks moved from two-week manual reviews to automated test-and-refine in hours, adopting coSTAR across multiple production agents with tangible benefits, including automated regression detection and saved human effort.
What failed first
The manual review loop failed predictably: without systematic tests, agents could regress silently, and manually QA-ing every change was unsustainable.
Results
Time savedtwo-week manual reviews
Source
https://www.databricks.com/blog/costar-how-we-ship-ai-agents-databricks-fast-without-breaking-things
Grounding & classification
Source type: technical build writeup
27 fields verified against source quotes.
agentic workflowai agentmulti agent workflowquality inspectioncode diff prbuilder submittedfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupquality assuranceai draft human approvalmonitor detect alert