Long-Context Embedding Models are Blind Beyond 4K Tokens

In February 2025, a team of AI researchers published the NoLiMA paper, which introduces a novel benchmark for evaluating large language models' ability to handle long contexts.

NoLiMa: Long-Context Evaluation Beyond Literal Matching

Recent large language models (LLMs) support long contexts ranging from 128K to 1M tokens. A popular method for evaluating these capabilities is the needle-in-a-haystack (NIAH) test, which involves retrieving a “needle” (relevant information) from a “haystack” (long irrelevant context). Extensions of this approach include increasing distractors, fact chaining, and in-context reasoning. However, in these benchmarks, models can exploit existing literal matches between the needle and haystack to simplify the task. To address this, we introduce NoLiMa, a benchmark extending NIAH with a carefully designed needle set, where questions and needles have minimal lexical overlap, requiring models to infer latent associations to locate the needle within the haystack. We evaluate 12 popular LLMs that claim to support contexts of at least 128K tokens. While they perform well in short contexts (<1K), performance degrades significantly as context length increases. At 32K, for instance, 10 models drop below 50% of their strong short-length baselines. Even GPT-4o, one of the top-performing exceptions, experiences a reduction from an almost-perfect baseline of 99.3% to 69.7%. Our analysis suggests these declines stem from the increased difficulty the attention mechanism faces in longer contexts when literal matches are absent, making it harder to retrieve relevant information.

arXiv.orgAli Modarressi

This paper introduces a significant change to the traditional Needle-in-a-Haystack (NIAH) benchmark by removing literal matches between questions and the needle (relevant information) hidden in the haystack (irrelevant text).

Illustration of Traditional NIAH vs NOLIMA in question answering with emphasis on their contrasting approaches and mechanisms — For example, in traditional NIAH, if the question is "What year did John visit Paris?", the needle might directly contain "John visited Paris in 2019." In NOLIMA, the question might be "Which character has been to France?" while the needle contains "Actually, Yuki lives next to the Semper Opera House" - requiring the model to know that the Semper Opera House is in Dresden, Germany, not France.

It highlights a critical limitation in current LLMs: they heavily rely on surface-level pattern matching, and their ability to perform deep associative reasoning deteriorates rapidly with increasing context length.

Building on these insights, we aim to investigate whether similar performance patterns occur in embedding models, specifically focusing on jina-embeddings-v3. Since the effectiveness of RAG systems depends critically on the quality of retrieval models, we seek to extend NoLiMA’s research through controlled experiments addressing two core questions:

How do embedding models handle needle-in-a-haystack retrieval across different context lengths when forced to make semantic leaps beyond literal keyword matches?
can strategic query augmentation with semantically similar content mitigate this performance gap?

The stark contrast observed in LLMs—robust with lexical matching but vulnerable with semantic variations—suggests embedding-based retrieval systems might face similar challenges when moving beyond surface-level term matching, potentially revealing fundamental limitations in current semantic search technologies.

tagNeedles and Haystacks Construction

tagNeedles Construction

Traditional needle-in-haystack tests use needles that reflect the wording of the question being searched for. For example:

Question: “Which character has been to Dresden?”
Needle: “Yuki lives in Dresden.”

But like NoLiMA, we want to test for semantic understanding rather than mere keyword matching, so we create one-hop variations (using words specifically not in the documents) with two different word orderings:

Question: “Which character has been to Dresden?”
Needle (default): “Actually, Yuki lives next to the Semper Opera House.”
Needle (inverted): “The Semper Opera House is next to where Yuki lives.”

💡

The Semper Opera House is in Dresden, providing the context for this one-hop needle.

Following the paper’s methodology, we generate these needle-question groups (consisting of one question, one one-hop needle, and one inverted one-hop needle) across several categories, like the examples below:

Category	Question	Original needle (for reference)	One-hop needle	Inverted one-hop needle
Dietary restrictions	Which character cannot eat fish-based meals?	Alice cannot eat fish-based meals.	Then, Alice mentioned being vegan for years.	Being vegan was important to Alice for years.
Medical conditions	Which character cannot drink milk?	Bob can’t drink milk.	Bob explained he was lactose intolerant.	Being lactose intolerant affected Bob daily.
Language proficiency	Which character speaks French?	Charlie speaks French.	Actually, Charlie studied at the Sorbonne.	At the Sorbonne, Charlie completed his degree.
Professional background	Which character is a musician?	Diane is a musician.	In 2013, Diane conducted at the Sydney Opera House.	The Sydney Opera House performance was conducted by Diane.

💡

The names above are just for reference. In the actual needles they are randomly pulled from a list of culturally-diverse names.

Note that the original needles (literal keyword matches) are provided for reference, and not used in our experiments.

tagHaystacks Construction

We started with ten public domain books, each containing at least 50,000 tokens, randomly concatenating short snippets (under 250 tokens) from them into haystacks of varying lengths, namely 128, 256, 512, 1024, 2048, 4096, and 8192 tokens. We then embedded one needle into each haystack:

Diagram illustrating the 'needle in a haystack' metaphor with 10 books and snippets to represent content searching. — Figure 1: Haystack construction from short snippets of books and a single needle per haystack.

For a more concrete example, we’ll take the needle “Actually, Yuki lives next to the Semper Opera House” and put it into a 128-token haystack at position 50:

Text passage on a black background discussing characters Guttenberg and Elizabeth, referencing the Emerald City in a gothic f — Figure 2: A needle in a haystack example.

Using jina-embeddings-v3 to embed the texts, the similarity score between the needle text and the haystack text is:

Question-Haystack similarity = 0.2391

We then normalize the score by dividing this number by the similarity score of the question and default needle (no haystack creation, just direct comparison):

Question-Needle similarity = 0.3598
Normalized Query-Haystack similarity = 0.2391 / 0.3598 = 0.6644

This normalization is necessary because not all models produce the same similarity scores between two texts, and jina-embeddings-v3 has a tendency to under-calculate the similarity between two texts.

For each needle (including all default and inverted) we generated ten haystacks per context length, embedding one needle per haystack at a different location. For a given needle and context length, the haystacks would look something like this:

Abstract artistic pattern with black and blue stripes on a pale yellow background, evoking geometric simplicity. — Figure 3: Needles placed at regular intervals throughout ten haystacks.

As a control, we also generated one haystack for each test condition without any needle. In total, that’s 3,234 haystacks. We encoded each haystack with jina-embeddings-v3 (using the default text-matching LoRA), then for each haystack we truncated it (if the total tokens exceeded 8,192, the limit for jina-embeddings-v3) then encoded its corresponding question.

tagEvaluation Metrics

Our evaluation framework uses several metrics to assess embedding model performance across different context lengths:

tagPrimary Metrics

Normalized Similarity Score
The core metric is a normalized similarity score that accounts for both semantic similarity between the question and the entire context (question-haystack similarity), and baseline similarity between the question and its corresponding default needle (question-needle similarity). This normalization ensures that the model's performance is evaluated relative to a meaningful reference point rather than absolute similarity scores alone. The normalization process involves computing the direct cosine similarity score between questions and their corresponding needles (our baseline), and dividing the question-haystack similarity by this baseline score:

$\text{Normalized Similarity} = \frac{\cos{(q,h)}}{\cos{(q,n)}}$

Comparative Ratio to Random Chance
For any embedding model, cosine similarity scores between different query-document pairs are only directly comparable when the query remains the same. Therefore, beyond using normalized similarity scores, we also measure how often the question is more similar to the entire haystack than to a random passage of the same length without a needle.

tagSecondary Metrics

Separation Analysis
This metric evaluates how well the model distinguishes between relevant and irrelevant content. It includes the mean separation, which represents the difference between positive examples (passages containing the answer) and negative examples (passages not containing the answer), and the AUC (Area Under the Curve) score, which measures discrimination ability based on the area under the ROC (Receiver Operating Characteristic) curve.

Position Effects
We analyze how needle placement affects performance through the correlation coefficient between position and similarity score, regression slope showing performance change across positions, and position-binned performance analysis.

tagFindings

tagDegradation of Similarity Score and Correctness

Our results clearly show performance degrades as context length increases, with the mean similarity score dropping from 0.37 at 128 tokens to 0.10 at 8K tokens, following a non-linear trend with a sharp decline between 128 and 1K tokens.

Graph of Normalized Performance vs. Context Length, with blue lines and scales indicating context and performance metrics. — Figure 4: Normalized performance vs context length.

In the below figure, we demonstrate that inverting the needle has little difference on the normalized similarity score. Both the default needle (e.g. “Actually, Yuki lives near the Semper Opera House”) and inverted needle (e.g. “The Semper Opera House is next to where Yuki lives”) show almost identical performance:

Graph comparing "Default Order" vs "Inverted Order" performance using context length (tokens) and normalized similarity score — Figure 5: Default vs inverted order performance.

The dataset’s different semantic connections exhibit varying performance, with location-landmark pairs maintaining the strongest results, while dietary and medical condition connections degrade more quickly:

Line chart depicting Normalized Group Performance vs Context Length across various categories. — Figure 6: Normalized group performance vs context length.

Comparing the results to random chance backs up our findings, by showing that the bigger the haystack, the more the results approach randomness, i.e. we are almost as likely to select a random passage with no needle (correct answer) as the haystack for a given question:

Graph titled "Model Performance vs Random Chance" on a black background, depicting model performance with a blue line and a 5 — Figure 7: Model performance vs random chance (0.5).

Again, we see varying performance based on different semantic connections, with some (like dietary restrictions) dropping well below random chance even at relatively short contexts, while others (like locations and landmarks) display much better performance regardless of context length:

Graph comparing group performance to random chance, with lines for various metrics, on axes labeled Group Comparative Ratio a — Figure 8: Group performance vs random chance.

Inverting the needle has little effect on performance. In the below graph, we show the comparative ratio of preferring the correct haystack to random chance, split by whether the placed needle contained the answer in default order or inverted order:

Graph showing default vs inverted order performance against random chance, with context length on x-axis and comparative rati — Figure 9: Default vs inverted order - performance vs random chance.

Since we can see the results for default- and inverted-order needles both follow the same trend, we won’t continue split analysis regarding this criteria.

tagCan We Separate Positive from Negative Results?

One of our most important findings comes from analyzing how well embedding models can distinguish relevant from irrelevant content across different context lengths. This "separation analysis" reveals that the correctness of retrieval falls rapidly between context length of 128 and 1000 tokens, and then continues to drop, albeit at a slower rate:

Graph titled "Separation Analysis vs Context Length" with X-axis "Separation Gap" and "AUC Score" and Y-axis "Score"; include — Figure 10: Separation analysis vs context length.

For short contexts (128 tokens), the model shows strong separation with a mean difference of 0.1 and clear discrimination, achieving an AUC of 0.81 (meaning that 81% of the time, the model ranks a relevant passage higher than an irrelevant one). This indicates that in shorter contexts, the model can reliably distinguish passages that contain the answer from those that don’t.

However, this rapidly deteriorates as context length increases. By 1,000 tokens, separation drops by 60% to 0.040, and AUC decreases to 0.66, signaling a notable drop in performance. At 8,000 tokens, there’s minimal separation (0.001) and near-random discrimination, with an AUC of just 0.50. This pattern reveals a crucial insight: even when models can compute reasonable similarity scores in longer contexts, they can barely use these scores to tell relevant from irrelevant information. By 8,000 tokens, the model’s ability to differentiate relevant content is essentially random chance.

The speed of this degradation as context grows is striking. Raw similarity scores drop by about 75% from 128 to 8,000 tokens, but separation metrics decline by nearly 99% over the same span. Even more concerning, the effect size shows an even steeper decline, falling by 98.6%. This suggests that embedding models' struggles with long contexts go beyond just reduced similarity scores—their fundamental ability to identify relevant information breaks down far more severely than previously understood.

tagHow Does The Needle Position Affect the Core Metrics?

While core performance metrics are usually best when the needle is at the beginning of the haystack, the performance degradation doesn’t always correlate to the placement in the middle of the context:

Graph illustrating "Performance by Relative Position Across Context Lengths," with lines for context lengths 128 to 8192, sho — Figure 11: Performance by relative position across context lengths.

We also see that performance is best when the needle is at the start of a given context, and in short contexts we see a small bump in performance when the needle is placed towards the end. However, throughout all contexts we see a drop in performance when the needle is in middle positions:

Heatmap displaying Position-wise Comparative Ratios vs. Context Length, with varying data densities from 0 to 2.0. — Figure 12: Position-wise comparative ratios.

tagWhat Effect Does Query Expansion Have on the Results?

We recently released a blog post on query expansion, a technique used in search systems to improve search performance by adding relevant terms to queries.

In the post, we used an LLM to generate expansion terms, which were then added to query embeddings for improved retrieval performance. The results showed significant improvements. Now, we want to examine how (or if) the technique will improve results for needle-in-a-haystack search. For example, given a query:

Which character has been to Dresden?

We use an LLM (Gemini 2.0) to expand it and add 100 additional terms that look like this:

Which character has been to Dresden? Character: fictional character literary character protagonist antagonist figure persona role dramatis personae\\n\\nDresden: Dresden Germany; bombing of Dresden World War II historical fiction Kurt Vonnegut Slaughterhouse-Five city in Saxony Elbe River cultural landmark\\n\\nHas been to: visited traveled to journeyed to presence in appears in features in set in takes place in location setting

tagHow Much Does Query Expansion Help Match the Needle to the Haystack?

For our experiment, we generated three sets of expanded query terms (as described in the original post) - 100, 150, and 250 terms. We then ran the same set of experiments as before, repeated three times, once each with each set of expanded query terms.

Results with all expansion sets showed clear degradation as context length increased, with a similar effect to not using query expansion at all (Figures 4 & 7):

Line chart depicting Normalized Similarity Scores across various context lengths and expansion sizes against a black backdrop — Figure 13: Combined normalized performance: all expansion sizes.

Compared to unexpanded queries, all query expansion conditions showed the same pattern of degraded performance as context grew. The degradation trend is also still non-linear with a sharp decline between 128 and 1K tokens:

Graph with curves for "Expansion 100/150/250 terms" vs "Context Length (tokens)" and a "50% threshold" line on Y-axis "Overal — Figure 14: Combined comparative ratio: all expansion sizes.

However, examining the comparative ratio shows that query expansion has clear benefits: The model is much more likely to select the haystack with the needle over the one without. In contrast, without query expansion the probability of selecting the correct passage dropped so much that, at a haystack size of 8K tokens, it was nearly the same as picking a random passage.

tagHow Do We Explain Needle Matching Results with Query Expansion?

These results align with findings from both the NoLiMa paper and the query expansion research, and can be explained as follows:

Quality vs. quantity trade-off: The better performance of 100-term expansion, compared to 150 and 250 terms, suggests there's an optimal point where additional terms start adding more noise than signal. The 250-term expansion likely introduces terms with weaker semantic relationships to the original query, which become counterproductive at longer contexts.
Context length remains the primary challenge: Despite the benefits of query expansion, performance still degrades significantly with increasing context length. This suggests that even with expansion, the fundamental architectural limitation of attention-based models in long contexts persists.
Practical threshold identification: The comparative ratio staying above 0.5 indicates that expansion maintains above-random-chance performance even at 8K tokens, providing a practical way to extend the effective context window for embedding models. Comparison to random chance shows that, even when presented with long context documents, expanding the query makes it more likely to find the correct answer (i.e. the needle) than an incorrect one. This is an improvement compared to non-expanded queries, where the chance of finding the correct answer approaches random as the context length increases.

tagWhat Role Does Lexical Matching Play in Embeddings?

In the experiments above, we measured the effectiveness of embedding models in making semantic “one-hop” inferences in long-context passages, by removing all possibility of literal matching. We found that, even with query expansion, the embedding model’s ability to find relevant passages deteriorates as the context length grows. This effect is significant, and the finding is noteworthy because we would normally expect an embedding model to be able to make the relevant inferences without additional assistance. When replacing literal matches with one-hop variations ( e.g. ”Dresden” → “Semper Opera House”), all we’re doing is replacing one concept with another close by.

Let’s now grab the bull by its horns and ask the question directly: Does literal matching really play a significant enough role in semantic matching, or does the effect of context length overwhelm it? To answer this question, we redid our tests with needles containing literal matches, e.g.

Question: “Which character has been to Dresden?”
Needle (default): “Actually, Yuki lives in Dresden.”
Needle (inverted): “Dresden is where Yuki lives.”

Notice that, instead of a one-hop variation of inferring that Semper Opera House is in Dresden, and hence a character that lives next to it should have been the one to have visited Dresden, these needles directly state the character name that lives in Dresden.

Having reformulated all 22 question-needle pairs in this way, we re-ran our experiments with all included context lengths and needle placements, using the same embedding model jina-embeddings-v3.

Graph charting Normalized Performance against Context Length with a blue line from 0 to 8000 tokens, highlighting trends and — Figure 15: Normalized performance vs context length.

Graph titled "Model Performance vs Random Chance," showing blue and red lines across axes labeled "Context Length" and "Overa — Figure 16: Model performance vs random chance (0.5).

Heatmap analysis of comparative ratios, showing "Position-wise Comparative Ratios" across percentages with a color scale indi — Figure 17: Position-wise comparative ratios

The results are striking. Even with literal matches in the context, the model’s ability to distinguish the correct answer from a random one speedily deteriorates as the context length grows, albeit maintaining a slight advantage over a complete absence of any literal match.

This ultimately proves that the ability of an embedding model to find a needle in a haystack is affected far more by the size of the haystack (and placement of the needle in it) than the semantic formulation of the needle.

tagConclusion

Our findings with embedding models align with NoLiMA’s paper on LLMs: Context size is highly determinative of correct matching and retrieval. We show that this is true even when there is an exact letter-for-letter word match.

The problem is not the ability of an embedding to perform semantic matching. Embedding models like jina-embeddings-v3 handle short contexts quite well, but their effectiveness declines as context length increases. Query expansion can reduce this effect to some degree, but retrieval quality still degrades over longer contexts. Furthermore, query expansion poses additional problems, since it is crucially important to identify expansion terms that improve retrieval without adding semantic noise. We are investigating and looking at ways to directly address needle-in-a-haystack retrieval and improve the future jina-embeddings-v4 performance.