quality_assurance · workflow
Netflix evaluates show synopses at scale using LLM-as-a-Judge, achieving 85%+ agreement with creative writers
Netflix hosts hundreds of thousands of synopses—often with multiple variants per show—making manual quality validation impossible at scale, while poor synopses directly drive member abandonment.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Synopsis quality evaluation triggered
Hundreds of thousands of synopses, usually with multiple variants per show, require quality evaluation at scale.
Tools used
LLMAgents-as-a-Judge
Outcome
The final LLM-as-a-Judge system achieves 85%+ agreement with creative writers, and its scores correlate with key streaming metrics, enabling proactive identification and fixing of quality issues weeks or months before a show debuts, with widespread adoption in the Netflix synopsis authoring workflow.
What failed first
An initial approach of using a single prompt to evaluate all quality criteria overloaded the LLM and yielded poor performance; early human calibration also showed low instance-level agreement due to the subjectivity of the task.
Results
Volume85%+
Grounding & classification
Source type: technical build writeup
23 fields verified against source quotes.
agentic workflowquality inspectionsummarizationproduct catalogfailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedmediaaccuracy improvementemployee productivitytechnical build writeupmarketing opsquality assurancehuman review queue