Netflix's Axion ML Fact Store eliminates training-serving skew and reduces offline feature regeneration from weeks to hours
Netflix's ML models train on weeks of historical data, so testing updated feature encoders required waiting weeks for feature logging to accumulate sufficient data — making experimentation slow and creating training-serving skew risk.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Production inference runs
Compute applications fetch member and video facts from gRPC services, run shared feature encoders, and score ML models to generate personalized recommendations.
Axion reduces offline feature regeneration from weeks to hours, EVCache queries run 3x–50x faster than Iceberg, and data quality monitoring detects more than 95% of data issues early — making Axion the de facto fact store for Netflix's Personalization ML models.
What failed first
Feature logging required weeks of waiting for data. ETL with normalized multi-table storage caused Spark shuffle issues at scale. Even a single denormalized Iceberg table was too slow for queries filtering hundreds of millions of rows to under a million, and bloom filters plus predicate pushdown were insufficient.