quality_assurance · workflow

Pinterest engineers build a test harness to optimize AI agent skill invocation rates

Pinterest engineers found that AI agents inconsistently invoked a custom iOS architecture skill (rx-mvvm), with baseline overall accuracy of only 73% for Codex and 62% for Claude Code—deemed unacceptable for critical engineering workflows.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Engineer submits prompt

Engineers issue prompts—often terse or ambiguous—to the agent, initiating skill invocation.

Tools used

Claude CodePin-agentGPT 5.2-codexBash

Outcome

By applying optimized frontmatter descriptions, aggressive directive language, and AGENTS.md skill tables, the team dramatically improved skill invocation rates on both agents, with gains much greater for Codex than for Claude Code.

What failed first

Initial 'vanilla' testing showed neither agent could guarantee 100% skill invocation, especially with terse or ambiguous prompts.

Results

Volume73%

Cost replaced62%

Source

https://medium.com/pinterest-engineering/an-engineers-guide-to-better-ai-skills-implementing-a-testing-process-to-optimize-agent-a000c9c9abcd

How we source this →

Grounding & classification

Source type: technical build writeup

21 fields verified against source quotes.

agentic workflowcode generationcode diff prfailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementemployee productivitytechnical build writeupquality assuranceagentic task execution