quality_assurance · saas · workflow
Monday.com builds an evals-driven development framework for its AI service workforce with LangSmith
Building a ReAct-based AI service workforce introduced cascading quality risks where a minor prompt deviation could compound across multi-step reasoning chains, yet most teams treat evaluation as a last-mile check rather than a Day 0 requirement.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Offline eval against golden dataset
The agent is run against curated golden datasets to test core logic and edge cases before any code change reaches production.
Tools used
LangSmithLangGraphVitestDocumentation MCPLangSmith MCP
Outcome
Monday.com achieved 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds), comprehensive coverage across hundreds of examples in minutes instead of hours, real-time production monitoring via Multi-Turn Evaluators, and evaluation logic managed as version-controlled production code with GitOps-style CI/CD deployment.
What failed first
Running offline evaluations serially created a major bottleneck in the development loop, forcing a tradeoff between testing depth and development pace.
Results
Time saved162.35s
Volume8.7x faster
Grounding & classification
Source type: technical build writeup
29 fields verified against source quotes, 4 dropped as unverifiable.
agentic workflowai agentquality inspectionchat transcriptknowledge basesupport ticketfailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupcustomer supportit supportquality assuranceautonomous resolution