quality assurance
Quality assurance AI workflow patterns
Verified production AI workflows in quality assurance — including named customers, verbatim metrics, and vendor case sources. The sub-patterns below open into the common implementation shape and first-deployment failures for each.
Across 227 documented quality assurance cases
Recurring tools
github copilot 15labelbox 15llms 14mcp 12claude code 11cursor 10github 10amazon bedrock 9braintrust 9claude 9rag 9slack 9
What fails first / common problems
Visual interpretation of thermography had a significantly lower positive predictive value than the AI approach, and mammography — the standard alternative — was radiation-based, restricted in frequency of use, and prohibitively expensive…
— NIRAMAI Health Analytix delivers AI-powered breast cancer detection at 90% accuracy with Google CloudRelying on fragmented tools from multiple providers for billing, inference, and security monitoring proved inefficient for the lean non-profit team.
— Day of AI Australia scales AI literacy simulation to 330,000+ students using Google CloudRelying on third-party tools like Gerrit and PullApprove for code review left Duolingo's primary repositories with widely varying cultures and pull request processes, creating inefficiency and preventing developers from moving easily bet…
— Duolingo boosts developer speed up to 25% with GitHub Copilot and CodespacesOn-premises infrastructure imposed lengthy build queues with non-elastic shared runners that caused cross-team instability and build failures.
— General Motors consolidates 150,000 repositories and deploys GitHub Copilot to accelerate secure software delivery at scaleThe previous call monitoring system had no screen capture element and required coaches to use multiple systems.
— Yorkshire Water's Loop subsidiary improves customer service quality with Verint Quality Management, cutting repeat calls and billing complaintsRepresentative reported outcomes
4x faster · 14x more
FNB Evaluates 14 Times More Interactions Using Verint Quality Bot
4,000+ hours · over 50,000 · $400,000
New York Area Health System achieves 6x ROI and closes 4,000+ care gaps with Notable's AI chart scrubbing
90% · more than US$200,000
NIRAMAI Health Analytix delivers AI-powered breast cancer detection at 90% accuracy with Google Cloud
under four months · over 330,000
Day of AI Australia scales AI literacy simulation to 330,000+ students using Google Cloud
from three hours to one · at least 25%
Duolingo boosts developer speed up to 25% with GitHub Copilot and Codespaces
Reported by the source case, as published — not independently verified.
Featured workflows in this category
A curated selection — highest-trust cases with the richest evidence (first-deployment failures documented, metrics on record). The full quality assurance corpus is reachable via search.
Sharper Shape builds streamlined annotation pipeline with Labelbox to detect utility defects
Labelbox
Sharper Shape cut labeling costs by as much as 50%, sped up model training by over 10X, and can now concentrate on model buildi….
Deque uses Labelbox Model Diagnostics and Catalog to improve accessibility ML model performance by 5%+ and cut labeling spend by over 50%
Labelbox → Model Diagnostics → Catalog
By filtering out one-third of less trustworthy data points and targeting data collection via Model Diagnostics and Catalog, Deq….
DoorDash builds a multi-agent AI code reviewer with 60% engineer acceptance rate
GitHub → Slack
The agent reviews more than 10,000 pull requests per week across 56 repositories, with 60.
Coinbase builds a QA AI agent to 10x testing effort at 1/10 the cost
qa-ai-agent → browser-use → MongoDB → BrowserStack
The qa-ai-agent detects 300% more bugs in the same timeframe at 86% lower cost than manual testing, with new tests integrable i….
Cloudflare builds CI-native multi-agent AI code review system across 48,095 merge requests
OpenCode → Claude Opus 4.7 → GPT-5.4 → Claude Sonnet 4.6
In its first month the system completed 131,246 review runs across 48,095 merge requests in 5,169 repositories, with a median r….
uReview: Scalable, Trustworthy GenAI for Code Review at Uber
uReview → Commenter → Fixer → Claude-4-Sonnet
uReview is deployed across all six of Uber's monorepos, analyzes over 90% of the approximately 65,000 weekly diffs, maintains a….
LangChain improves coding agent 13.7 points on Terminal Bench 2.0 through harness engineering
deepagents-cli → LangSmith → Terminal Bench 2.0 → Harbor
By changing only the harness without modifying the underlying model, LangChain improved their coding agent from 52.
Minions: Stripe's fully unattended one-shot coding agents merge over a thousand pull requests per week
Claude → Cursor → goose → MCP
Over a thousand pull requests are merged per week at Stripe that are completely minion-produced with no human-written code, ena….