Workflow · saas · workflow
One Line of Code, 41% Better Memory: When One AI Agent Optimizes Another
Coding agents lose all context between sessions, and Lerim's memory extraction and deduplication quality was uncertain — there was room to improve but no clarity on which parts of the system needed it.
How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Setup optimization harness
The author pointed Claude Code at Lerim's codebase with an eval harness and golden dataset and told it to optimize.
Tools used
Claude CodeLerimDSPyPydanticMiniMax M2.5AutoResearch
Outcome
Round 1 achieved a 41% improvement in composite quality score, with dedup accuracy rising from 0.28 to 0.72 and maintain improving by 29% as a cascade effect. Round 2 added a further 3.4% extraction quality improvement by teaching the LLM explicit quality criteria.
What failed first
The initial evaluation harness measured the wrong thing — rewarding recall without penalizing over-extraction — so the memory store accumulated low-value entries despite high eval scores.
Results
Volume41%
Grounding & classification
Source type: technical build writeup
31 fields verified against source quotes.
agentic workflowai agentdata extractionmulti agent workflowknowledge basebuilder submittedfailure mode describedmetric backedproduction runtime claimedtools describedworkflow describedsoftwareaccuracy improvementerror reductiontechnical build writeupagentic task execution