quality_assurance · workflow

Netflix builds SMAD system to detect speech and music in audio content at production scale

Netflix needs to systematically classify speech, music, and effects regions across its large audio catalog to enable production and delivery tasks, but collecting fine-resolution frame-level labels is costly, labor-intensive, and restricted by copyright limitations.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Audio file received

Audio files are delivered from post-production studios in standard 5.1 surround format at 48 kHz sampling rate.

Tools used

CRNNPython

Outcome

Netflix deployed SMAD using a large noisy-labeled catalog dataset and a CRNN architecture, enabling hundreds of audio production and delivery tasks daily across global teams with substantial productivity returns at scale.

Results

Time saved1608 hours

Source

https://netflixtechblog.com/detecting-speech-and-music-in-audio-content-afd64e6a5bf8

How we source this →

Grounding & classification

Source type: technical build writeup

14 fields verified against source quotes.

named customerproduction runtime claimedsource backedtools describedworkflow describedmediaemployee productivitytechnical build writeupquality assuranceextract classify route