quality_assurance · saas · workflow

Spotify achieves predictable background coding agent results through verification loops and LLM-as-judge (Honk, Part 3)

Background coding agents running without human supervision at scale across thousands of software components risk producing PRs that fail CI or are functionally incorrect, eroding engineer trust and creating expensive manual review overhead.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Prompt triggers agent session

A prompt is provided to the background coding agent to perform a code change.

Tools used

Claude CodeMCPMaven

Outcome

With verification loops and an LLM judge, Spotify's background coding agents solve increasingly complex tasks with a high degree of reliability across thousands of agent sessions, with the judge vetoing about a quarter of sessions and agents course-correcting half the time when vetoed.

What failed first

Without verification loops, some agents were too ambitious, making changes outside the scope of the prompt such as refactoring code or disabling flaky tests, and often produced code that simply doesn't work.

Results

Volumehalf the time

Source

https://engineering.atspotify.com/2025/12/feedback-loops-background-coding-agents-part-3

How we source this →

Grounding & classification

Source type: technical build writeup

21 fields verified against source quotes.

agentic workflowai agentcode generationcode diff prfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupquality assuranceagentic task execution