Context Window Saturation in Reasoning Agents
Context Engineering is necessary when attention is all that matters.
The recent “failure” of Apple’s Intelligent Reasoning paper got me thinking. In my previous explorations into the Prisoner’s Dilemma and also 4x4 Tic Tac Toe, I noticed that even though my agents were able to reason towards complete games and might occasionally even, based on the right observations, reach the right conclusions, they still fail spectacularly in most cases. Now, the Towers of Hanoi is another game that was used to evaluate whether an agent can actually reason through a problem or not.
There is one big difference, though. Towers of Hanoi has a recursive deterministic algorithmic solution that will always terminate after the minimum of (2^n)-1 moves (i.e., 3 disks = 7 moves, 9 disks = 511 moves).
def move(n, source, target, auxiliary):
if n == 1:
pegs[target].append(pegs[source].pop())
draw_towers()
else:
move(n - 1, source, auxiliary, target)
move(1, source, target, auxiliary)
move(n - 1, auxiliary, target, source)
Towers of Hanoi Algorithm
Therefore, I did not select Towers of Hanoi for my Game Theory series. What’s the reasoning challenge here? I think that if there is an optimal way to do it that can be done algorithmically, why change that?
Apple, in all its wisdom, of course, thinks otherwise, so here we are.
Reasoning and Large Context Windows
The point of reasoning is to draw justified conclusions from available information to make better decisions and, by extension, understand the world more accurately. In Tic Tac Toe, you play against another opponent who changes the state of the world. In Towers of Hanoi, that is never the case. So, in my opinion, it is not a good benchmark to evaluate the effectiveness of reasoning. As I had mentioned before, recent language models have the ability to take longer and longer contexts as input. But unfortunately, relatively little is known about how well they use these longer contexts. I’d argue that not only because of that, having longer contexts is actually detrimental ot their performance.
In theory, LRMs are specifically designed or adapted to perform multi-step logical reasoning, structured problem-solving, and decision-making, while LLMs are primarily optimized for understanding and generating natural language based on statistical patterns in text. These models conduct explicit reasoning processes between special tokens of <think> and </think> before producing their “final answer”. On top, recent studies have shown that LRMs can adaptively allocate reasoning strength (i.e., the number of reasoning tokens) based on problem difficulty; they tend to allocate more reasoning strength to harder questions to improve accuracy. But is more complexity in Towers of Hanoi really that type of “harder” question?
What makes a question hard?
One dimension of these is algorithmic complexity and search space size. How many people can brute-force all possible chess moves? Yet we expect a reasoning agent to do the same? I think this is a misconception. LLMs struggle to extract relevant information when having to work off very large contexts, requiring language models to successfully operate over long sequences. Existing language models are generally implemented with Transformers, where memory and compute needs increase quadratically with sequence length.
You can see in these charts that storing more memory helps with long-term dependencies, but increases computation and memory usage. Modern Transformer language models were often trained with relatively moderate context windows. GPT-4.5 is 128,000 tokens,and Claude 3 is about 200,000 tokens. There are now models available that have far larger context windows, but as I had mentioned before, are these 1-million+ context windows truly practical?
The observation is that LRMs have enabled complex, step-by-step reasoning but often introduce significant overthinking, infinite thinking loops, incorrect tangents, resulting in excessively verbose and redundant outputs.
Here are some of the problems I found:
LLM retrieval accuracy declines log-linearly toward zero as interference accumulates. Also, errors arise from retrieving previously overwritten values.
Costs increase linearly with larger contexts. Processing larger contexts requires more computation, and LLM providers charge per token, which means a longer context (i.e, more tokens) makes each query more expensive.
The memory length is fixed, so only a finite window of past context is available.
The memory is not differentiable across segments, so training on long context is limited to recurrence tricks, not true end-to-end optimization.
Most agent frameworks implement a ReAct (Reflect->Act) pattern in the form of a loop that "reasons" until a "final_answer" has been found. Most frameworks also implement some form of session cache (scratchpad), tool use, and thought logging/tracing. Overall, it works, but it is really inefficient. It’s like a synchronous infinite loop that stalls you so you can’t do anything else.
If you now consider this chart from the Apple paper, specifically the first column you will notice that they seem to drop off at almost at a similar point of complexity.
source
I believe, in a nutshell, Apple may have rediscovered (maybe even proven) context window saturation (brain freeze).
So I dug deeper into the topic
Evaluating Language Model Context Windows (Arxiv)
Less is More: Why Use Retrieval Instead of Larger Context Windows (Pinecone Blog)
Lost in the Middle: How Language Models Use Long Contexts (Arxiv)
Working Memory (ScienceDirect, Wikipedia)
Unable to Forget (Arxiv)
MRCR (HuggingFace)
Wait, We Don’t Need to “Wait”! (Arxiv)
On Reasoning Strength Planning in Large Reasoning Models (Arxiv)
And I aggregated these problems, which might have an effect on the observed reasoning effectiveness.
I am a strong believer in engineering context windows efficiently.
And that includes:
Managing session cache/scratchpad/short-term memory buffering
Making sure the data that goes into context is well understood.
If you don’t do that, you will likely run into these issues.
Problems
1. Context Window Saturation
As the agent “thinks”, it populates the context window with increasingly more information. Once your context window (incl. session information like a PDF upload or a script) exceeds their limit, the model might lose context or cut off part of the prompt, leading to hallucinated, irrelevant, and/or false responses.
2. Increased Ambiguity
Long contexts might contain repetitive and ambiguous information, forcing the model to struggle to find a clear direction, thus iterating over incorrect thought paths inefficiently. This also leads the model to drift by incorporating irrelevant aspects that might not align with the original prompt. Especially, real-world scenarios deviate from research scenarios where financial reports, legal documents, or customer feedback are far more intricate and have far more nuances and complexities than might be available during research.
3. Lost in the middle
Accuracy performance is high when relevant information appears at the start or end of the input. This is often referred to as primacy and recency biases. When placed in the middle, relevant content is often ignored, exposing a core limitation in how today’s language models handle long contexts.
source - page 1
The LitM paper expands that even extended-context models don’t show meaningful gains over their shorter-context counterparts. Even when retrieval systems surface more documents (e.g., 50 vs. 20 from Wikipedia for NaturalQuestions-Open), performance plateaus well before retriever recall does. More context isn’t the issue; effective use of it is
4. Needle in the Haystack
The needle-in-a-haystack problem refers to the challenge language models face when relevant information (the "needle") is buried within long input contexts (the "haystack"). The needle is often a synthetic toy during research, amplifying the problem once models go into production. As context windows grow, models often struggle to retrieve or attend to critical details that are not positioned near the beginning or end. This exposes a weakness in their attention mechanisms and memory efficiency. Even with larger context capacity, models tend to overlook mid-positioned information, leading to performance drops. It's a key limitation in scaling context length without improving retrieval or prioritization within the window.
Solutions
So, what can be done to improve or even alleviate the problems?
Optimize Prompts – Refactor inputs to prioritize concise, high-signal information early, reducing verbosity and improving retrieval by the model.
Chunking – Break long contexts into smaller, semantically coherent chunks to help the model process and recall relevant sections more effectively.
Medoid Voting – Use representative samples (medoids) from multiple retrieved passages and aggregate model outputs to improve robustness and consistency.
Intracontext Interference – Identify and mitigate conflicts or distractions between multiple pieces of information within the same context window.
Longformer – Applies a sliding window attention mechanism to reduce computational load and focus on local token relationships, improving efficiency in long contexts.
PI-LLM – Streams semantically related key–value pairs incrementally and queries only the final result, testing how well models retain and update relevant information over time.
Overthink Detection with the Predictor – Detects when a model's later layers override earlier correct reasoning, helping flag when it's second-guessing accurate answers.
Efficient Reasoning with Activation Steering – Guides model behavior by manipulating internal activations, steering it toward more accurate or efficient reasoning paths.
In closing
This post has already gotten quite long, and Substack tells me it is close to the email length limit. I thought that was interesting since there seems to be also an effective limit on how much information can be sent in an email.
Throughout this post, I aimed to explain the problem Apple ran into when running their reasoning research. Personally, if there is a reliable deterministic way to solve the problem, I would always prefer to use this as it is normally significantly more computationally efficient to do that. Most of the research quoted above indicates that current language models do not robustly make use of information in long input contexts. For years, I have held the same opinion. I believe that mirrors the naive assumption about how the human mind works as well. But also underlines that just dumping all documents into a massive context window might not necessarily lead to efficient outcomes.
In a nutshell, make sure you engineer your context. Not only your prompts.
More is to come on this topic.
Pro Tips
1 token ≈ ¾ of a word in English (on average).
Watch for special tokens (e.g.
\n
, indentation, or emojis).Prompts count too – system + user + assistant messages all use tokens.
Have a great day