quality_assurance · workflow
Netflix builds SMAD system to detect speech and music in audio content at production scale
Netflix needs to systematically classify speech, music, and effects regions across its large audio catalog to enable production and delivery tasks, but collecting fine-resolution frame-level labels is costly, labor-intensive, and restricted by copyright limitations.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Audio file received
Audio files are delivered from post-production studios in standard 5.1 surround format at 48 kHz sampling rate.
Tools used
CRNNPython
Outcome
Netflix deployed SMAD using a large noisy-labeled catalog dataset and a CRNN architecture, enabling hundreds of audio production and delivery tasks daily across global teams with substantial productivity returns at scale.
Results
Time saved1608 hours
Grounding & classification
Source type: technical build writeup
14 fields verified against source quotes.
named customerproduction runtime claimedsource backedtools describedworkflow describedmediaemployee productivitytechnical build writeupquality assuranceextract classify route