incident_management · workflow

Trendyol builds an AI multi-agent oncall system that diagnoses production alerts in minutes

Trendyol's oncall engineers spent 30–60 minutes on manual investigation per production alert — checking logs, metrics dashboards, code, and infrastructure across multiple microservices — while the actual fix was often trivial.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Alert received in Slack
An alert arrives in a Slack channel and the AI system detects it and begins investigation.
Tools used
ElasticsearchPostgreSQLKafkaCouchbase
Outcome

Since deploying the Oncall Support Workspace, investigation time dropped from 30–60 minutes to minutes, oncall engineers report less stress and faster context acquisition, and known false positives are resolved instantly without waking anyone up.

Results
Time savedfrom 30–60 minutes of manual investigation to structured root cause analysis in minutes
Source

https://medium.com/trendyol-tech/how-we-built-an-ai-powered-oncall-system-that-diagnoses-production-alerts-in-minutes-86386be0d4b8

How we source this →

Grounding & classification
Source type: technical build writeup
29 fields verified against source quotes, 1 dropped as unverifiable.
agentic workflowai agentanomaly detectionmulti agent workflowragknowledge basebuilder submittedfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommercecycle time reductionemployee productivityresolution time reductiontechnical build writeupincident managementit supportautonomous resolutionescalation workflowextract classify route