Workflow · workflow

Netflix develops Advantage-Weighted Supervised Fine-Tuning (A-SFT) to post-train generative recommender systems on noisy reward signals

Generative recommenders trained purely by imitating observed user behavior can perpetuate suboptimal recommendations because user interactions are influenced by trends and external suggestions. Standard post-training techniques developed for LLMs (PPO, DPO) cannot be directly applied to recommendation systems due to the absence of counterfactual data, noisy reward models, and unknown logging policies.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · User feedback signal collection

User feedback including explicit signals such as ratings and reviews and implicit signals like watch time and click-through rates serves as the input to post-training.

Tools used

HSTUPPODPOGRPOCQLIPO

Outcome

A-SFT achieves better alignment between the pre-trained generative recommendation model and the reward model, outperforming baseline behavior cloning as well as reward-model-dependent algorithms (CQL, PPO, DPO, IPO) on recommendation metrics including NDCG, HR, and MRR.

What failed first

Reward models trained for recommendation settings do not significantly outperform simple baselines (average user reward or average title reward), because users explore only a small subset of titles and their viewing choices exhibit permutation invariance that makes reward learning difficult.

Results

Volumeless than 4%

Source

https://netflixtechblog.com/post-training-generative-recommenders-with-advantage-weighted-supervised-finetuning-61a538d717a9

How we source this →

Grounding & classification

Source type: technical build writeup

20 fields verified against source quotes.

personalizationpredictive analyticsrecommendation systemfailure mode describedmetric backednamed customersource backedtools describedworkflow describedmediaaccuracy improvementtechnical build writeup