incident_management · ecommerce · workflow

Zalando builds AI-powered multi-stage LLM pipeline to transform two years of postmortems into actionable infrastructure insights

Zalando accumulated thousands of postmortem documents but could not extract strategic patterns at scale. Each postmortem takes 15–20 minutes to read, making company-wide retrospective analysis of years of incidents cognitively and practically impossible.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Postmortem corpus ingested

Thousands of postmortem documents are fed into the pipeline as input.

Tools used

Claude Sonnet 4AWS BedrockNotebookLMLM Studio

Outcome

The multi-stage LLM pipeline reduced postmortem analysis time from days to hours and boosted productivity three times. It surfaced hidden patterns including a finding that automated change validation could shield 25% of subsequent datastore incidents. Surface attribution error remains at approximately 10% even with the latest model, and hallucinations became negligible.

What failed first

An initial attempt using Google's NotebookLM produced severe hallucinations and lost incident context when generating summaries, reducing effective productivity rather than improving it. Small open-source models showed up to 40% hallucination probability, and a no-code agentic approach was ruled out due to performance limitations and inaccuracies.

Results

Time savedsignificantly reduced the time for analysis from days to hours

Volumethree times

Cost replacedapproximately 10%

Running sincetwo years of data

Source

https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html

How we source this →

Grounding & classification

Source type: technical build writeup

36 fields verified against source quotes.

data extractiondocument classificationsummarizationknowledge basefailure mode describedhuman review describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedecommercecycle time reductionemployee productivityerror reductiontechnical build writeupincident managementquality assuranceextract classify routehuman review queue