quality_assurance · workflow

Netflix builds SMAD system to detect speech and music in audio content at production scale

Netflix needs to systematically classify speech, music, and effects regions across its large audio catalog to enable production and delivery tasks, but collecting fine-resolution frame-level labels is costly, labor-intensive, and restricted by copyright limitations.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Audio file received
Audio files are delivered from post-production studios in standard 5.1 surround format at 48 kHz sampling rate.
Tools used
CRNNPython
Outcome

Netflix deployed SMAD using a large noisy-labeled catalog dataset and a CRNN architecture, enabling hundreds of audio production and delivery tasks daily across global teams with substantial productivity returns at scale.

Results
Time saved1608 hours
Source

https://netflixtechblog.com/detecting-speech-and-music-in-audio-content-afd64e6a5bf8

How we source this →

Grounding & classification
Source type: technical build writeup
14 fields verified against source quotes.
named customerproduction runtime claimedsource backedtools describedworkflow describedmediaemployee productivitytechnical build writeupquality assuranceextract classify route