back_office_ops · workflow

Netflix builds an internal LLM post-training framework scaling from SFT to on-policy RL

At Netflix scale, post-training LLMs became an engineering problem as much as a modeling one — researchers had to manage complex data pipelines, distributed GPU clusters, and multi-stage orchestration instead of focusing on model innovation.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Configuration-driven job submission
Developers express post-training jobs as configuration files that select a recipe and plug in task-specific components.
Tools used
PyTorchRayvLLMVerlMakoAWSHugging Face AutoTokenizerFSDPLoRAFlexAttention
Outcome

Netflix shipped a managed post-training framework covering SFT, DPO, RL, and Knowledge Distillation, lowering the barrier for teams to iterate on advanced techniques. On-the-fly sequence packing improved effective token throughput by up to 4.7x for their most skewed dataset.

What failed first

The original SFT-centric SPMD architecture could not support on-policy RL workflows that emerged with DeepSeek-R1 and GRPO. Separately, binding to low-level tokenization libraries created a silent training-serving token skew that caused inexplicable quality regressions.

Results
Time savedtripling that layer's execution time
Volumeup to 4.7x
Source

https://netflixtechblog.com/scaling-llm-post-training-at-netflix-0046f8790194

How we source this →

Grounding & classification
Source type: technical build writeup
26 fields verified against source quotes.
agentic workflowpersonalizationrecommendation systemfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedmediaemployee productivitythroughput increasetechnical build writeupback office opsagentic task execution