Memory and Context
Understanding why memory matters and how we can design an efficient router for it.
Context window size is growing, and as a result, memory management is becoming increasingly important. So much so that even mighty OpenAI is now rolling out memory as a new feature.
But is having access to a large context necessarily a good thing?
The answer lies in managing that context, both effectively and efficiently.
In general, a larger context window allows the model to hold more tokens in working memory, helping it to keep track across longer sequences of conversation or larger bodies of text.
Here is a table of current models with large context windows.
Table of LLMs with large context window.
But is having a large context window actually an advantage?
In short. I am not a fan.
The Cons of Large Context Windows
In my opinion, large context windows introduce several real disadvantages, especially for autonomous agents.
First, cognitive overload becomes a problem—agents may struggle to focus their attention heads accurately as they are buried under irrelevant content. Also, as this paper shows, cognitive overload opens the door for prompt injections, a security risk.
Second, the larger token size increases inference time and compute costs as processing huge contexts takes longer and demands more resources. This is especially problematic when our agents operate in real-time loops or cost-sensitive deployments, as might be the case in robotics, autonomous driving, or rendezvous and proximity operations.
Third, current LLMs lack structured memory control—treating the entire context as flat rather than hierarchical—so agents can't easily segment or organize information like humans using short-term, long-term, or task-specific memory.
Fourth, compositional reasoning and planning can suffer; rather than abstracting or building modular steps, agents may default to "dumping" everything into context and hoping the model connects the dots, reducing clarity and efficiency.
And finally, large contexts might increase security and leakage risks, as agents could inadvertently pull in sensitive or outdated information, leading to data privacy issues or context confusion.
In sum, while expanding context windows is commonly seen as a powerful tool, it also introduces nuanced problems that autonomous systems must carefully manage.
Four different concepts for managing memory
Memory is an important topic for Agents. And there has been some recent progress coming from the launch of the Redis Agent Memory Server.
source
But let’s explore first how memory in agents differs from memory for LLMs or Chat Apps. While ChatGPT now can reason, can use search as a tool, and can access memory, its capabilities along those domains are still limited. So how is it different?
Let’s first nail down some definitions:
Scratchpad - a temporary space where the agent can write down its thoughts, intermediate steps, or reasoning as it works through a task
Context: Information the agent uses to understand and respond accurately (like the current conversation or task)
Context window: the limited amount of information that the model can "see" and process at once.
Data: Usually maintained in persistent storage, data can be passed, structured or unstructured, into the context window for further processing. Common formats are JSON, CSV, or a Textfile because those formats can easily be tokenized. Receiving data from a graph or SQL DB should normally be done through a tool.
Most of these types are examples of short-term memory.
But why does memory matter so much?
Well, short-term memory is crucial as it serves as the brain’s temporary workspace, allowing it to hold and manipulate information needed in the moment.
source : Cognitive Aspects of Structured Process Modeling
For us meatbags, memory also plays a vital role in allowing us to enjoy daily tasks like following conversations, remembering directions, or doing mental math. Without it, even simple actions would become difficult, as we’d quickly forget what we were just doing or thinking.
For us humans, short-term memory is also an essential component for learning, acting as the gateway to long-term memory by holding information long enough for it to be encoded. Here, short-term memory supports problem-solving, decision-making, and language comprehension by keeping relevant pieces of information active in our minds. The comparison to how agents need this information becomes immediately apparent.
Short-term memory helps us and our agents to stay focused and on track. Since for most healthy humans, memory works “out-of-the-box”, how would we build something like this for agents? And just be crystal clear. Even though there are fantastic tools for memory management out there, I don’t think either of them is fully optimized for the needs of autonomous agents.
So what are the challenges?
Core Challenges to Address
In my opinion, it comes down to these:
Limited Context Windows: Most LLMs have fixed token limits that restrict how much information can be processed at once. Even if the limit is fixed by performance.
Memory Management: Balancing what to keep in active context vs what to store for later retrieval.
Relevance Determination: Identifying which pieces of information are most relevant for the current task.
Information Organization: Structuring context in a way that's optimally useful for the agent.
Conceptual Design - Key Components
The system that I have in mind looks therefore like this:
The Context Store serves as the central repository for all information the agent has processed, enabling organized and efficient access to past interactions. It is structured hierarchically—spanning sessions, conversations, and topics—to maintain contextual relevance over time. The metadata tagging system enhances retrieval capabilities by allowing fast and targeted access to specific content. This is usually a simple TF-IDF keyword dump. Additionally, the store incorporates vector embedding storage to support semantic search, enabling the agent to understand and retrieve information based on meaning rather than just keywords.
The Context Manager then orchestrates the flow of information into and out of the active context window, ensuring that the most relevant data is available for real-time processing. It employs prioritization algorithms to assess the importance of different inputs, selecting only the most critical pieces for immediate use. To maximize efficiency, the CM uses compression techniques to condense information without losing essential meaning, and implements caching strategies to store frequently accessed data, reducing latency and improving responsiveness. This is probably the most complex to implement.
The Retrieval System efficiently locates relevant information when needed, playing a key role in supporting the agent's responsiveness and accuracy. It begins with query understanding to accurately interpret the user's intent and retrieval requirements. I’d think most LLMs can do this already quite effectively. Leveraging a multi-modal search approach—combining keyword, semantic, and temporal dimensions—it ensures comprehensive coverage across different types of queries. I have written in the past already about BM25 and Neo4J, so I won’t repeat this here. For paying readers, this can be sourced from the archive. Thank you for your kind support at this point. Btw, you can get a one-month free membership if you successfully refer only 3 new signups. This is the link. It would help me a lot.
Anyways, to refine results, the system applies relevance scoring, ranking retrieved content to prioritize the most useful and contextually appropriate information.
The Context Window Optimizer maximizes the utility of the agent's limited context space by intelligently managing what information is included at any given time. It employs dynamic summarization techniques to condense historical information into concise, relevant summaries that preserve meaning. This can be done with a light SLM. A template manager then ensures consistent prompt structures, helping maintain clarity and coherence across interactions. I had good experiences with Jinja templates in the past. LlamaIndex recently launched with RichPromptTemplate, just that feature.
Additionally, the optimizer performs token counting and optimization to make the most efficient use of available space within the context window, balancing detail with brevity.
Technical Approach, Architecture, And Data Flow
When the agent receives an input that requires context to be managed, we need to ensure that it can respond accurately. The Context Manager identifies what past information—such as previous interactions, topics, or metadata—is needed to interpret the input. Next, the Retrieval System locates the most relevant information using query understanding and multi-modal search across keywords, semantics, and time. It ranks results by relevance to ensure usefulness. The Context Window Optimizer then condenses and structures the data to fit within the agent’s limited context window. This involves summarizing historical content, applying consistent prompt templates, and counting tokens to make efficient use of space. The result is an optimized, information-rich context package that equips the agent to generate clear and informed responses. Operating seamlessly in the background, this system maintains continuity across conversations and adapts to evolving needs, ensuring the agent stays both efficient and contextually aware throughout complex interactions.
Implementation Approach
In Phase 1, we focus on establishing the core framework for effective context management within the agent system. The first step is to define standardized interfaces that allow consistent handling of context across various components, ensuring smooth communication and modular integration. Following this, a basic context storage layer is implemented, using vector embeddings to represent stored information in a way that supports future semantic search capabilities. I used MySQL with ElasticSearch in the past and liked the performance. But LanceDB would also do a formidable job.
Simple retrieval mechanisms are introduced at this stage, relying on basic retrieval strategies such as recency and keyword matching to fetch relevant information quickly and reliably. To support efficient use of the agent’s limited context window, token counting utilities are developed, enabling the system to track and manage how much content can be included in each response. This foundational phase lays the groundwork for more advanced features by prioritizing modular design, lightweight retrieval, and essential optimization tools that will scale as the system grows in complexity.
Phase 2 focuses on adding advanced features to enhance the context management system. The first step is implementing hierarchical memory structures, which organize context into levels such as sessions, topics, and conversations, allowing for more efficient and nuanced data retrieval. To optimize the use of context space, compression and summarization techniques are introduced, reducing large volumes of information into concise, meaningful representations. Relevance scoring algorithms are also developed to rank the retrieved context, ensuring that the most relevant information is prioritized for decision-making. Relevance is usually either cosine similarity, Euclidean distance or a probabilistic retrieval function based on a bag of words.
Phase 3 focuses on optimization and integration to enhance system performance and adaptability. The first step involves adding performance benchmarks and optimization techniques to ensure the system operates efficiently under various loads and contexts. Again, governance is a key deliverable of any such system. Therefore, here is where we implement this. Configurable policies are then implemented, allowing the system to be tailored for different use cases, such as conversational agents, personal assistants, or other agent tasks. Lastly, visualization tools for context debugging are developed, providing insights into how context is managed and enabling easier troubleshooting and fine-tuning. These enhancements not only boost the system's performance but also make it more versatile and user-friendly, paving the way for deployment in diverse environments and applications.
Differentiating Features
Relevance-aware summarization dynamically adjusts the level of detail in summaries based on the importance of the information. Context persistence strategies are implemented to configure how and which data is retained across sessions, ensuring relevant context is available when needed. Task-specific context templates optimize the structure of context to suit different agent tasks, improving efficiency.
Memory consolidation periodically refines long-term memory (!), ensuring that outdated or irrelevant information is processed. Additionally, explainable decisions provide transparency by clearly outlining why certain context is included or excluded, fostering trust and understanding in the system. These considerations ensure that the system remains efficient, adaptive, and transparent.
Open Questions — In Closing
Several open questions remain in designing an effective context management system. First, how can we objectively determine the "importance" of information to ensure it aligns with the agent’s needs?
Second, what evaluation metrics should be used to assess context quality, especially when the context is dynamic and evolving? Another challenge is balancing generic applicability with specialized optimizations—how do we ensure the system remains flexible without compromising on task-specific performance? Additionally, identifying a default summarization approach that strikes the right balance between brevity and completeness is crucial.
Anyways. That’s what I got for you today. I hope you enjoyed this post on memory.
Please like, share, subscribe.