quality_assurance · saas · workflow

GitHub Copilot agent-driven development enables Copilot Applied Science team to ship 11 agents in under three days

Analyzing coding agent trajectories across standardized benchmark runs required reading hundreds of thousands of lines of JSON code per day—an impossible task to do manually that forced the researcher into a repetitive loop of using Copilot to surface patterns before investigating them.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Benchmark trajectory analysis trigger
Benchmark runs against standardized evaluation datasets produce trajectory files containing agent thought processes and actions that need analysis.
Tools used
GitHub CopilotCopilot CLIClaude Opus 4.6VSCodeCopilot SDKMCP serversCopilot Code Review
Outcome

The eval-agents tool and agent-driven development methodology enabled five contributors to ship 11 new agents, four new skills, and a new concept in under three days—a change of +28,858/-2,884 lines of code across 345 files—while also reducing the lines of trajectory code the researcher had to read from hundreds of thousands to a few hundred.

Results
Time savedless than three days
Volume11
Source

https://github.blog/ai-and-ml/github-copilot/agent-driven-development-in-copilot-applied-science/

How we source this →

Grounding & classification
Source type: technical build writeup
33 fields verified against source quotes.
agentic workflowai agentcode generationcode diff prbuilder submittedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareemployee productivitythroughput increasetime savedtechnical build writeupback office opsquality assuranceagentic task executionai draft human approval