Netflix develops Advantage-Weighted Supervised Fine-Tuning (A-SFT) to post-train generative recommender systems on noisy reward signals
Generative recommenders trained purely by imitating observed user behavior can perpetuate suboptimal recommendations because user interactions are influenced by trends and external suggestions. Standard post-training techniques developed for LLMs (PPO, DPO) cannot be directly applied to recommendation systems due to the absence of counterfactual data, noisy reward models, and unknown logging policies.
A-SFT achieves better alignment between the pre-trained generative recommendation model and the reward model, outperforming baseline behavior cloning as well as reward-model-dependent algorithms (CQL, PPO, DPO, IPO) on recommendation metrics including NDCG, HR, and MRR.
Reward models trained for recommendation settings do not significantly outperform simple baselines (average user reward or average title reward), because users explore only a small subset of titles and their viewing choices exhibit permutation invariance that makes reward learning difficult.