quality_assurance · saas · workflow
Spotify achieves predictable background coding agent results through verification loops and LLM-as-judge (Honk, Part 3)
Background coding agents running without human supervision at scale across thousands of software components risk producing PRs that fail CI or are functionally incorrect, eroding engineer trust and creating expensive manual review overhead.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Prompt triggers agent session
A prompt is provided to the background coding agent to perform a code change.
Tools used
Claude CodeMCPMaven
Outcome
With verification loops and an LLM judge, Spotify's background coding agents solve increasingly complex tasks with a high degree of reliability across thousands of agent sessions, with the judge vetoing about a quarter of sessions and agents course-correcting half the time when vetoed.
What failed first
Without verification loops, some agents were too ambitious, making changes outside the scope of the prompt such as refactoring code or disabling flaky tests, and often produced code that simply doesn't work.
Results
Volumehalf the time
Grounding & classification
Source type: technical build writeup
21 fields verified against source quotes.
agentic workflowai agentcode generationcode diff prfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupquality assuranceagentic task execution