Autonomous observability at Pinterest: AI agents bridge fragmented logs, metrics, and traces via MCP
Pinterest's observability infrastructure predated OpenTelemetry, leaving logs, metrics, and traces in disconnected silos with no shared context or correlation. On-call engineers had to jump across multiple interfaces to root-cause incidents, and a steep per-tool learning curve compounded the time loss, especially for newer engineers.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Engineer submits alert link
An on-call engineer provides the Tricorder with their alert link or number to begin the investigation.
Tools used
MCPAgent2AgentLLMsTricorder Agent
Outcome
The Tricorder Agent, built on Pinterest's centralized MCP server, lets engineers submit an alert link and receive filtered dashboard links plus root-cause hypotheses and next steps without switching between interfaces, targeting MTTR reduction and freeing engineers to focus on resolving incidents.
What failed first
When first building the MCP server agent, Pinterest discovered that allowing the agent to query data organically caused it to exceed its context window and crash, requiring new strategies to constrain query scope.