back_office_ops · workflow

Netflix builds an internal LLM post-training framework scaling from SFT to on-policy RL

At Netflix scale, post-training LLMs became an engineering problem as much as a modeling one — researchers had to manage complex data pipelines, distributed GPU clusters, and multi-stage orchestration instead of focusing on model innovation.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Configuration-driven job submission

Developers express post-training jobs as configuration files that select a recipe and plug in task-specific components.

Tools used

PyTorchRayvLLMVerlMakoAWSHugging Face AutoTokenizerFSDPLoRAFlexAttention

Outcome

Netflix shipped a managed post-training framework covering SFT, DPO, RL, and Knowledge Distillation, lowering the barrier for teams to iterate on advanced techniques. On-the-fly sequence packing improved effective token throughput by up to 4.7x for their most skewed dataset.

What failed first

The original SFT-centric SPMD architecture could not support on-policy RL workflows that emerged with DeepSeek-R1 and GRPO. Separately, binding to low-level tokenization libraries created a silent training-serving token skew that caused inexplicable quality regressions.

Results

Time savedtripling that layer's execution time

Volumeup to 4.7x

Source

https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194

How we source this →

Grounding & classification

Source type: technical build writeup

26 fields verified against source quotes.

agentic workflowpersonalizationrecommendation systemfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedmediaemployee productivitythroughput increasetechnical build writeupback office opsagentic task execution