quality_assurance · saas · workflow

Factory's anchored iterative summarization outperforms OpenAI and Anthropic context compression strategies for long-running AI agent sessions

Long-running AI agent sessions generate millions of tokens that exceed any model's working memory, and naive aggressive compression causes agents to forget critical details—file paths, error messages, past decisions—leading to wasted tokens re-reading files and re-exploring dead ends.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Context limit reached

When a long-running agent session generates millions of tokens exceeding the model's context window, compression is triggered.

Tools used

GPT-5.2Claude SDK/responses/compact

Outcome

Factory's structured summarization scores 0.35 points higher than OpenAI and 0.26 higher than Anthropic overall, with accuracy showing the largest gap (Factory 4.04), while maintaining comparable compression efficiency (98.6% vs OpenAI's 99.3%).

What failed first

Generic summarization treats all content as equally compressible, silently dropping file paths and decisions; traditional metrics like ROUGE or embedding similarity failed to capture whether an agent can actually continue working after compression.

Results

Volume3.70

Source

https://factory.ai/news/evaluating-compression

How we source this →

Grounding & classification

Source type: technical build writeup

35 fields verified against source quotes.

agentic workflowai agentsummarizationchat transcriptcode diff prfailure mode describedmetric backednamed customerproduction runtime claimedsource backedtools describedworkflow describedsoftwareaccuracy improvementtime savedtechnical build writeupquality assurancecase to summary