Netflix builds an internal LLM post-training framework scaling from SFT to on-policy RL
At Netflix scale, post-training LLMs became an engineering problem as much as a modeling one — researchers had to manage complex data pipelines, distributed GPU clusters, and multi-stage orchestration instead of focusing on model innovation.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Configuration-driven job submission
Developers express post-training jobs as configuration files that select a recipe and plug in task-specific components.
Tools used
PyTorchRayvLLMVerlMakoAWSHugging Face AutoTokenizerFSDPLoRAFlexAttention
Outcome
Netflix shipped a managed post-training framework covering SFT, DPO, RL, and Knowledge Distillation, lowering the barrier for teams to iterate on advanced techniques. On-the-fly sequence packing improved effective token throughput by up to 4.7x for their most skewed dataset.
What failed first
The original SFT-centric SPMD architecture could not support on-policy RL workflows that emerged with DeepSeek-R1 and GRPO. Separately, binding to low-level tokenization libraries created a silent training-serving token skew that caused inexplicable quality regressions.