incident_management · workflow

Netflix formulates out-of-memory kill prediction on streaming devices as a machine learning classification problem

TVs and set-top boxes running Netflix have tighter memory constraints than compute devices, making out-of-memory kills a common cause of app crashes. Netflix needed a way to predict OOM kills in advance so it could take pre-emptive device-specific actions to avoid crashing.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Device and runtime data collection

Device capability data distributed across more than three schemas in the Big Data Platform is joined into a single indexable schema alongside runtime memory readings.

Tools used

Big Data PlatformANNsXGBoostAdaBoostElasticNet

Outcome

The article describes the methodology for formulating OOM kill prediction as an ML classification problem, covering dataset curation, labeling strategy, and feature engineering. Actual model results and confusion matrices were redacted for confidentiality.

Results

Volumeover 99.1%

Source

https://netflixtechblog.com/formulating-out-of-memory-kill-prediction-on-the-netflix-app-as-a-machine-learning-problem-989599029109

How we source this →

Grounding & classification

Source type: technical build writeup

15 fields verified against source quotes.

anomaly detectionpredictive analyticsnamed customertools describedworkflow describedmediatechnical build writeupincident managementmonitor detect alert