finance_ops · workflow

Zalando rebuilds ML pipeline for payment-default fraud detection on Amazon SageMaker

Zalando's second-generation Scala/Spark ML pipeline for detecting payment defaults was tightly coupled to a single framework making modern Python libraries difficult to adopt, relied on custom code that added maintenance burden, suffered from memory issues and latency spikes with slow instance startup, and had a monolithic design that fused feature preprocessing with model training into a single cluster.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Training data preprocessing

Training data is preprocessed using a Databricks cluster and a scikit-learn batch transform job on SageMaker.

Tools used

Amazon SageMakerzflowAWS Step FunctionsAWS LambdasDatabricksscikit-learnXGBoostPyTorchTensorflow

Outcome

The new SageMaker-based pipeline is framework-independent with clear separation between preprocessing and training, and reduced scale-up time by 50%. Load tests confirm a single ml.m5.large instance handles 200 requests/second with p99 latency under 80ms.

What failed first

An original Python/scikit-learn ML setup was replaced in 2015 by a Scala/Spark system to scale better, but this second-generation system accumulated its own technical pain points that necessitated a third migration.

Results

Time saved50%

Volume99.9%

Cost replacedup to 200%

Source

https://engineering.zalando.com/posts/2021/02/machine-learning-pipeline-with-real-time-inference.html

How we source this →

Grounding & classification

Source type: technical build writeup

28 fields verified against source quotes.

fraud detectionpredictive analyticsfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommercecycle time reductionthroughput increasetechnical build writeupecommerce opsfinance opsextract classify route