quality_assurance · saas · workflow

Monday.com builds an evals-driven development framework for its AI service workforce with LangSmith

Building a ReAct-based AI service workforce introduced cascading quality risks where a minor prompt deviation could compound across multi-step reasoning chains, yet most teams treat evaluation as a last-mile check rather than a Day 0 requirement.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Offline eval against golden dataset

The agent is run against curated golden datasets to test core logic and edge cases before any code change reaches production.

Tools used

LangSmithLangGraphVitestDocumentation MCPLangSmith MCP

Outcome

Monday.com achieved 8.7x faster evaluation feedback loops (from 162 seconds to 18 seconds), comprehensive coverage across hundreds of examples in minutes instead of hours, real-time production monitoring via Multi-Turn Evaluators, and evaluation logic managed as version-controlled production code with GitOps-style CI/CD deployment.

What failed first

Running offline evaluations serially created a major bottleneck in the development loop, forcing a tradeoff between testing depth and development pace.

Results

Time saved162.35s

Volume8.7x faster

Source

https://blog.langchain.com/customers-monday/

How we source this →

Grounding & classification

Source type: technical build writeup

29 fields verified against source quotes, 4 dropped as unverifiable.

agentic workflowai agentquality inspectionchat transcriptknowledge basesupport ticketfailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytechnical build writeupcustomer supportit supportquality assuranceautonomous resolution