quality_assurance · saas · workflow

Chroma context rot study: LLM performance degrades non-uniformly with increasing context length

The Needle in a Haystack (NIAH) benchmark is widely used to assert that LLMs handle long contexts reliably, but it only tests narrow lexical retrieval and does not reflect real-world tasks requiring semantic understanding, ambiguity resolution, or distractor handling. Models are consequently assumed to perform uniformly across long-context scenarios when they do not.

How it works
Common implementation structure
How this type of workflow is generally built, generalized across documented cases — not tied to any one vendor's stack. Click any stage to read what happens there. Specific products that implement these stages appear in “Tools commonly seen” below.
Stage 1 · Controlled experiment design
Experiments are designed to hold task complexity constant while varying only input length, directly isolating context length as the variable of interest.
Tools used
text-embedding-3-smalltext-embedding-3-largejina-embeddings-v3voyage-3-largeall-MiniLM-L6-v2LongMemEvalvector databaseGPT-4.1Llama 4Claude Opus 4Claude Sonnet 4GPT-3.5 turbo
Outcome

Across all 18 LLMs and experiments, model performance consistently degrades with increasing input length in non-uniform ways. Distractors amplify degradation as context grows; shuffled haystacks outperform structurally coherent ones; and models perform better when the needle is semantically distinct from the haystack.

What failed first

NIAH produced consistently high scores across all major models, leading to the widely held perception that long-context handling was largely a solved problem—when in fact the benchmark only measured a narrow lexical retrieval capability.

Results
Volume18
Source

https://research.trychroma.com/context-rot

How we source this →

Grounding & classification
Source type: technical build writeup
27 fields verified against source quotes, 4 dropped as unverifiable.
ragknowledge basehuman review describedmetric backedsource backedtools describedworkflow describedsoftwaretechnical build writeupquality assurance