incident_management · saas · workflow

Autonomous observability at Pinterest: AI agents bridge fragmented logs, metrics, and traces via MCP

Pinterest's observability infrastructure predated OpenTelemetry, leaving logs, metrics, and traces in disconnected silos with no shared context or correlation. On-call engineers had to jump across multiple interfaces to root-cause incidents, and a steep per-tool learning curve compounded the time loss, especially for newer engineers.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Engineer submits alert link
An on-call engineer provides the Tricorder with their alert link or number to begin the investigation.
Tools used
MCPAgent2AgentLLMsTricorder Agent
Outcome

The Tricorder Agent, built on Pinterest's centralized MCP server, lets engineers submit an alert link and receive filtered dashboard links plus root-cause hypotheses and next steps without switching between interfaces, targeting MTTR reduction and freeing engineers to focus on resolving incidents.

What failed first

When first building the MCP server agent, Pinterest discovered that allowing the agent to query data organically caused it to exceed its context window and crash, requiring new strategies to constrain query scope.

Results
Time savedreducing mean time to resolution (MTTR)
Volume3 billion data points per minute
Source

https://medium.com/pinterest-engineering/autonomous-observability-at-pinterest-part-1-of-2-eb0adae830ba

How we source this →

Grounding & classification
Source type: technical build writeup
32 fields verified against source quotes.
agentic workflowai agentanomaly detectionmulti agent workflowragknowledge basebuilder submittedfailure mode describednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwarecycle time reductionemployee productivitytime savedtechnical build writeupincident managementit supportagentic task executionmonitor detect alertrag answering