ecommerce_ops · workflow

Zalando's Machine Learning Platform: from experimentation notebooks to production pipelines at scale

Crossing the gap between notebook-based ML experimentation and production-grade pipelines was the core challenge: Jupyter notebooks do not scale to production requirements (security, reproducibility, observability, performance), and manually writing CloudFormation templates for pipelines was verbose and error-prone.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Explore data in Datalab
ML practitioners use Datalab (hosted JupyterHub) to query data sources, visualize results, and validate hypotheses before building production pipelines.
Tools used
JupyterHubDatabricksApache SparkAmazon SageMakerAWS Step FunctionsAWS CDKCloudFormationAWS LambdaS3BigQueryMicroStrategyBackstagezflowHPC
Outcome

zflow has been used to create hundreds of ML pipelines at Zalando, and the tooling abstracts away infrastructure complexity so ML practitioners can focus on their domain rather than the infrastructure.

What failed first

CloudFormation templates for Step Functions pipelines became too verbose and tedious to edit manually at scale, requiring an internal abstraction layer (zflow) to remain maintainable.

Results
Time savedless than a minute
Running sinceearly 2019
Source

https://engineering.zalando.com/posts/2022/04/zalando-machine-learning-platform.html

How we source this →

Grounding & classification
Source type: technical build writeup
31 fields verified against source quotes.
forecastingpersonalizationrecommendation systemproduct catalogmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommerceemployee productivitythroughput increasetechnical build writeupback office opsecommerce opsdata sync enrichment