incident_management · workflow

Netflix formulates out-of-memory kill prediction on streaming devices as a machine learning classification problem

TVs and set-top boxes running Netflix have tighter memory constraints than compute devices, making out-of-memory kills a common cause of app crashes. Netflix needed a way to predict OOM kills in advance so it could take pre-emptive device-specific actions to avoid crashing.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Device and runtime data collection
Device capability data distributed across more than three schemas in the Big Data Platform is joined into a single indexable schema alongside runtime memory readings.
Tools used
Big Data PlatformANNsXGBoostAdaBoostElasticNet
Outcome

The article describes the methodology for formulating OOM kill prediction as an ML classification problem, covering dataset curation, labeling strategy, and feature engineering. Actual model results and confusion matrices were redacted for confidentiality.

Results
Volumeover 99.1%
Source

https://netflixtechblog.com/formulating-out-of-memory-kill-prediction-on-the-netflix-app-as-a-machine-learning-problem-989599029109

How we source this →

Grounding & classification
Source type: technical build writeup
15 fields verified against source quotes.
anomaly detectionpredictive analyticsnamed customertools describedworkflow describedmediatechnical build writeupincident managementmonitor detect alert