quality_assurance · saas · workflow

Chroma context rot study: LLM performance degrades non-uniformly with increasing context length

The Needle in a Haystack (NIAH) benchmark is widely used to assert that LLMs handle long contexts reliably, but it only tests narrow lexical retrieval and does not reflect real-world tasks requiring semantic understanding, ambiguity resolution, or distractor handling. Models are consequently assumed to perform uniformly across long-context scenarios when they do not.

How it works

Common implementation structure

How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.

Stage 1 · Controlled experiment design

Experiments are designed to hold task complexity constant while varying only input length, directly isolating context length as the variable of interest.

Tools used

text-embedding-3-smalltext-embedding-3-largejina-embeddings-v3voyage-3-largeall-MiniLM-L6-v2LongMemEvalvector databaseGPT-4.1Llama 4Claude Opus 4Claude Sonnet 4GPT-3.5 turbo

Outcome

Across all 18 LLMs and experiments, model performance consistently degrades with increasing input length in non-uniform ways. Distractors amplify degradation as context grows; shuffled haystacks outperform structurally coherent ones; and models perform better when the needle is semantically distinct from the haystack.

What failed first

NIAH produced consistently high scores across all major models, leading to the widely held perception that long-context handling was largely a solved problem—when in fact the benchmark only measured a narrow lexical retrieval capability.

Results

Volume18

Source

https://research.trychroma.com/context-rot

How we source this →

Grounding & classification

Source type: technical build writeup

27 fields verified against source quotes, 4 dropped as unverifiable.

ragknowledge basehuman review describedmetric backedsource backedtools describedworkflow describedsoftwaretechnical build writeupquality assurance