Democratizing Machine Learning at Netflix: Building the Model Lifecycle Graph
As Netflix's ML investments scaled across business domains, models became black boxes with no discovery infrastructure — practitioners had to traverse fragmented, siloed tools to answer basic questions about lineage, ownership, and impact, and cross-domain reuse of ML assets was extraordinarily difficult.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Source system event ingestion
Source systems emit thin events containing an identifier and event type via Kafka and AWS SNS/SQS.
Tools used
KafkaAWS SNS/SQSDatomicElasticsearch
Outcome
Netflix built MDS (Metadata Service) with the Model Lifecycle Graph, enabling every ML practitioner to discover, understand, and reuse ML assets across all domains through the AIP Portal, replacing multi-system manual investigation with single graph queries.
What failed first
The siloed tooling left each system unaware of the others — the model registry did not know which A/B tests used its models, the pipeline orchestrator was unaware of downstream model dependencies, and practitioners had no way to answer cross-domain impact or lineage questions.