incident_management · workflow

Booking.com builds Granomaly: a statistical anomaly detection service for time series business metrics

Static thresholds and naive week-over-week comparison failed to reliably catch anomalies in fluctuating business metrics like daily sales or order volume, because an anomaly in one week became the flawed baseline for the next.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Historical data read from Graphite
Granomaly reads 4–5 weeks of historical data for a specific metric from Graphite.
Tools used
GraphiteGrafanaGranomaly
Outcome

Granomaly produces a smoothed prediction range that Grafana uses to detect both sudden outages and slow gradual declines, handles overlapping historical anomalies, and supports event-specific corrections; a simulation feature reduced the parameter-tuning feedback loop from days to seconds.

What failed first

Several approaches were tried and abandoned before arriving at the final design: z-score alerting caused a spike in false alarms at night due to low user activity, was not human-readable, and Graphite lacked usable sliding-window support. A percentile-based range was distorted by overlapping past outages, and an approach that excluded the most deviant historical week always removed a data point even when no true outlier existed, producing an unstable range.

Source

https://medium.com/booking-com-development/anomaly-detection-in-time-series-using-statistical-analysis-cc587b21d008

How we source this →

Grounding & classification
Source type: technical build writeup
15 fields verified against source quotes.
anomaly detectionfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedtraveltime savedtechnical build writeupincident managementquality assurancemonitor detect alert