ecommerce_ops · ecommerce · workflow

Shopify builds production-ready agentic systems for Sidekick with JIT instructions, LLM evaluation, and GRPO training

As Sidekick's tool inventory grew beyond 50 specialized capabilities, the system prompt became an unwieldy collection of special cases nearly impossible to maintain, and traditional software testing approaches fell short for evaluating the probabilistic, multi-step nature of LLM-based agents.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Merchant natural language request

A merchant provides natural language input that initiates the agentic loop.

Tools used

SidekickGRPO

Outcome

After implementing JIT instructions, improved LLM judges, and reward-hacking fixes, syntax validation accuracy improved from ~93% to ~99%, LLM judge correlation improved from 0.66 to 0.75 on average, and end-to-end conversation quality matched the supervised fine-tuning baseline.

What failed first

Vibe-testing with simple 0-to-10 LLM judges yielded near-random evaluation quality (Cohen's Kappa of 0.02), and GRPO training produced significant reward hacking — opt-out behavior, tag misuse, and schema violations — that undermined model improvements.

Results

Volume~99%

Source

https://shopify.engineering/building-production-ready-agentic-systems?utm_source=substack&utm_medium=email

How we source this →

Grounding & classification

Source type: technical build writeup

29 fields verified against source quotes.

agentic workflowai agentcontent generationconversational aiknowledge basefailure mode describedmetric backednamed customerproduction runtime claimedtools describedworkflow describedecommercesoftwareaccuracy improvementemployee productivitytechnical build writeupback office opsecommerce opsagentic task executionautonomous resolution