quality_assurance · workflow

Netflix evaluates show synopses at scale using LLM-as-a-Judge, achieving 85%+ agreement with creative writers

Netflix hosts hundreds of thousands of synopses—often with multiple variants per show—making manual quality validation impossible at scale, while poor synopses directly drive member abandonment.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Synopsis quality evaluation triggered

Hundreds of thousands of synopses, usually with multiple variants per show, require quality evaluation at scale.

Tools used

LLMAgents-as-a-Judge

Outcome

The final LLM-as-a-Judge system achieves 85%+ agreement with creative writers, and its scores correlate with key streaming metrics, enabling proactive identification and fixing of quality issues weeks or months before a show debuts, with widespread adoption in the Netflix synopsis authoring workflow.

What failed first

An initial approach of using a single prompt to evaluate all quality criteria overloaded the LLM and yielded poor performance; early human calibration also showed low instance-level agreement due to the subjectivity of the task.

Results

Volume85%+

Source

https://netflixtechblog.com/evaluating-netflix-show-synopses-with-llm-as-a-judge-6269251e6f28

How we source this →

Grounding & classification

Source type: technical build writeup

23 fields verified against source quotes.

agentic workflowquality inspectionsummarizationproduct catalogfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedmediaaccuracy improvementemployee productivitytechnical build writeupmarketing opsquality assurancehuman review queue