Shopify builds production-ready agentic systems for Sidekick with JIT instructions, LLM evaluation, and GRPO training
As Sidekick's tool inventory grew beyond 50 specialized capabilities, the system prompt became an unwieldy collection of special cases nearly impossible to maintain, and traditional software testing approaches fell short for evaluating the probabilistic, multi-step nature of LLM-based agents.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Merchant natural language request
A merchant provides natural language input that initiates the agentic loop.
Tools used
SidekickGRPO
Outcome
After implementing JIT instructions, improved LLM judges, and reward-hacking fixes, syntax validation accuracy improved from ~93% to ~99%, LLM judge correlation improved from 0.66 to 0.75 on average, and end-to-end conversation quality matched the supervised fine-tuning baseline.
What failed first
Vibe-testing with simple 0-to-10 LLM judges yielded near-random evaluation quality (Cohen's Kappa of 0.02), and GRPO training produced significant reward hacking — opt-out behavior, tag misuse, and schema violations — that undermined model improvements.