incident_management · saas · workflow

How Meta detects and mitigates silent data corruptions across its AI hardware fleet

Silent Data Corruptions (SDCs) — hardware errors that cause miscomputation without leaving detectable traces — significantly threaten AI training and inference reliability at Meta. Over 66% of training interruptions stem from hardware failures, and SDC rates have risen to about one fault per thousand devices as silicon density in accelerators has increased.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · SDC fault occurs silently
Hardware miscomputes without leaving detectable traces, leading applications to consume incorrect results.
Tools used
FleetscannerRippleHardware SentinelPyTorchServiceLab
Outcome

Meta deployed three complementary SDC detection mechanisms — Fleetscanner, Ripple, and Hardware Sentinel — fully productionized at scale across AI and non-AI infrastructure. Hardware Sentinel outperforms testing-based methods by 41% across architectures, applications, and data centers.

Results
Volumeover 66%
Running since2022
Source

https://engineering.fb.com/2025/07/22/data-infrastructure/how-meta-keeps-its-ai-hardware-reliable/?utm_source=substack&utm_medium=email

How we source this →

Grounding & classification
Source type: technical build writeup
25 fields verified against source quotes.
anomaly detectionquality inspectionfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupincident managementquality assurancemonitor detect alert