incident_management · workflow

Netflix Auto Remediation uses ML to resolve 56% of Spark memory configuration errors without human intervention

Netflix's data platform runs hundreds of thousands of workflows and millions of jobs daily, but the rule-based error classifier could not automatically remediate memory configuration errors or handle the roughly half of job failures that went unclassified, requiring costly manual cross-team engineering effort for each incident.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Job failure triggers pipeline

Upon a job failure, Scheduler calls Pensive to get the error classification.

Tools used

PensiveNightingaleConfigServiceMetaflowNetflix MaestroAx libraryMLP

Outcome

Auto Remediation successfully remediates about 56% of all memory configuration errors without human intervention and reduces associated monetary costs by about 50% by applying correct configurations or disabling doomed retries.

What failed first

The rule-based classifier Pensive could classify errors but not fix them: memory configuration errors still required manual expert remediation, and unclassified errors caused jobs to retry repeatedly with the default policy, incurring unnecessary compute costs.

Results

Time saved600

Volume56%

Cost replaced50%

Source

https://netflixtechblog.com/evolving-from-rule-based-classifier-machine-learning-powered-auto-remediation-in-netflix-data-039d5efd115b

How we source this →

Grounding & classification

Source type: technical build writeup

25 fields verified against source quotes.

predictive analyticsfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedmediaautomation ratecost reductiontechnical build writeupback office opsincident managementautonomous resolutionextract classify route