Applying the Tolman-Eichenbaum Machine to Generalization Tasks in Autonomous Driving
Components To A WorldMap Model For Autonomous Driving Agents
Day 138 at Chez Doritos.
One of the most remarkable abilities of the mammalian brain is its capacity to generate flexible behavior that adapts across different contexts.
For example, let's imagine you spent months learning how to drive a car in your hometown. You study the rules, learn the dimensions of the car, and familiarize yourself with the streets — you know the layout of intersections, traffic patterns, and local driving customs.
Note to self. A phatic finger is a public gesture carrying shared meaning on the highways of the Australian Outback.
You travel to a new city you've never visited before. Even though you've never driven on these particular streets, you are able to navigate and drive effectively.
But how is that possible?
If you think about it, you have never been in that particular situation, so this must be a totally new problem, right? Most traditional robot trajectory learning methods using video-based training would indicate so. In reality though, for us meat bags, the answer lies in the ability of our brain to generalize. We are able to strip away the particular sensory context of your hometown streets and extract the abstract notion of general driving procedures.
At the same time, we also know the general principles of how roads and road rules work, so one quick assessment of the traffic patterns and signs is enough to relate that particular layout to your inner model of driving.
Unless you are driving in India, I suppose. That’s not a dunk on India. I sat on the board of an India RBI-regulated NBFC for a while and could enjoy this frequently.
Although I was not allowed to drive in India — for obvious reasons, I am able to drive in many parts of the world. How?
The Cognitive Map
Our ability to generalize on driving patterns requires information about the world to be organized into a multi-purpose coherent framework. Also known as a cognitive map.
His name was Edward C. Tolman
Let's run an experiment. You are working on an autonomous car. You drive the car to a test ground and train it. Every time the car finds its way to the goal the driving agent gets a reward. After a few successful trials, you drive the car in a new test-ground. Normally, the car would try to run to the familiar path, but what would happen if it finds it blocked? Now the model is faced with a decision.
Which alternative path to pursue instead?
Let's think about how to approach this decision-making task.
One way is having the car learn by associations which, similar to real life, means that they would take a path that was most similar to the one that originally led to the reward. If, however, our car had some sort of internal map of the spatial layout, they would choose a path that traversed in the direction of reward although this way had never been experienced before, so it's not directly associated with reward.
Experiments like these, but with life animals, were conducted in the 1930s by Edward Tolman an American psychologist. He also coined the term "Cognitive map" relating to the idea that animals have something like a mental map of their surrounding space. While his research was impactful, it took 40 years until neuroscientists were able to see how such a map is manifested in the neural activity of the brain.
Source: Trajectory of a rat through a square environment is shown in black. Red dots indicate locations at which a particular entorhinal grid cell fired.
Neurons in Hippocampal Formation
Although we know now that cognitive maps serve as general-purpose representations of the world, historically, they were usually studied only in the context of spatial behavior. Spatial behavior refers to how humans and animals navigate, interact with, and use space in their environment. The predominant view is that the workhorse of cognitive mapping in the mammalian brain is the Hippocampal formation which includes the hippocampus and the entorhinal cortex which serves as a gateway through which information flows in and out of the hippocampus.
The hippocampus itself consists of several specialized spatially selective neurons and their relationship to how knowledge is organized into a cognitive map at the level of single cells.
Components of a Tolman-Eichenbaum Machine
Let's briefly review the major wet-ware components of the hippocampus and entorhinal cortex:
Place cells: The “base layer” of the hippocampus. Neurons that become active when the subject (i.e., our car) is in a specific location or place in an environment. The code for the current location is in a context-dependent way since the response of a single cell is totally different in different surroundings.
Grid cells: Located in the entorhinal cortex. In a biological subject, grid cells fire in regular periodic patterns arranged on a hexagonal grid as the subject moves in the environment. This regular pattern allows the subject to understand its position in space by storing and integrating information about location, distance, and direction. Maybe think about it like the ping of a radar or Bluetooth Beacon.
Boundary-vector cells: Found in the entorhinal cortex, they activate whenever the subject is at a certain distance and a certain direction away from any object in the environment. I think it’s very similar to Euclidean distance in machine learning.
Landmark cells: These neurons are similar to boundary-vector cells but respond selectively to a specific object and not others. The relationship to named-entity recognition tasks is obvious.
The entorhinal cortex provides a kind of general coordinate system allowing the brain to perform vector computations and estimate distances.
The hippocampus,, forms a more specific code, providing the brain with information about particular locations and landmarks in this coordinate system.
A multi-modal world model
Although all of these neurons historically have been discovered during experiments with spatial behavior, what's crucial is that this selectivity is not restricted to physical space. For example, if you train a rat to press a lever adjusting the frequency of the sound, you'll see that certain neurons in the hippocampus become selective to a particular frequency range like a conventional place cell but in the one-dimensional space of sound frequencies. Neurons in the entorhinal cortex also develop a frequency-dependent pattern of activity but which is periodic, resembling grid cells limited to a one-dimensional frequency space.
In another study on Grid Cells for Conceptual Spaces, human subjects were trained to navigate in a highly abstract two-dimensional space of bird silhouettes characterized by leg and neck lengths. Participants could independently vary the length of legs and neck with the controller and their task was to morph a bird into a particular configuration, while activity of their brains was monitored through fMRI. Remarkably, the activity of their entorhinal cortex showed signs of a hexagonal symmetry as people mentally moved in this conceptual space of birds, which is incredibly consistent with the grid cell code.
All of this suggested that the hippocampus and entorhinal cortex together construct a multi-dimensional and multi-modal representation of the world. I.e., the world model as it is known in the AI world.
Can it be that the same software is also used to solve computational tasks in non-spatial domains?
Graph Formalism
Luckily, there might be a simple and elegant mathematical formalism that connects physical and abstract spaces known as graph theory. Graphs and Graph Reasoning are areas I have been covering already in earlier posts (1,2,3).
Let’s think about it like this. The elements in all the tasks we have seen so far are connected trough a defined relationship.
For instance, neighboring locations in a room — also check out “Memory and Knowledge Management“ — are in the real world physically connected to each other. You can move along them in horizontal and vertical directions. Although you might have fond memories of your breakfast, we usually can’t move physically in temporal directions (yet).
Essentially — to quickly recap — a graph is a mathematical structure that consists of a set of points called vertices or nodes and a set of lines called edges that connect pairs of vertices. Now the vertices can represent any kind of object or entity and the edges can correspond to any kind of connection or relation between the vertices. For instance, we can construct a graph of a two-dimensional space by connecting each location node to its four neighbors in a square grid-like manner.
To effectively traverse through the graph, however, you must know where you are located on this graph at every point in time. Otherwise, there is no point in having organized knowledge in this way in the first place.
To keep track of where you are, you can use what's called path integration. In the physical space, path integration refers to using self-motion cues such as movement speed and direction to accumulate movement vectors and update your position.
For arbitrary graphs, however, you'll need a more abstract but similar notion of path integration that is a finite set of rules of how to add different types of relationships. For example, taking the root sibling parent is equivalent to parent, and so forth. And, the very same graph can be used for generalization as the underlying structure of connections is fundamentally the same. What's different is the type of incoming sensory cues.
So in conclusion, if the hippocampal-entorhinal system can construct these relational graphs and carry out path integration and root finding on them, then it makes perfect sense how this system could be reused by the brain across different modalities.
But how do we know which graphs to build in the first place? How can the brain come up with such a structured representation? To understand this, we need to first address another super important concept and that is the idea of a latent space.
Latent Spaces
As it follows from the name, latent space is something that is not directly observable from external cues.
Let's go back to our car example and view the world through the eyes of this subject, which is performing an alternation task. The car is driving through Los Angeles and each time the car reaches an intersection, the car has to decide whether to turn left or right.
Now, the task is such that the reward is always alternating between the sides. So the car has to learn that if on the previous trial, it turned left and received a reward that means that now it should go to the right and vice versa.
Let's try to think about the cognitive map that must be built.
What are the relevant behavioral features and
What is the structure required to perform this particular task?
First of all, we already know there is a spatial component since you need to know where you are in this city and update your relative and absolute location. So we can reasonably expect to see conventional place cells when looking inside the hippocampus. However, information about physical location alone is not enough to solve the problem of obtaining the maximum reward.
Since the car has learned that it needs to alternate turns at every trial, it has to remember the direction of the previous turn. In other words, we need to have access to short-term memory.
To completely capture all the relevant information about this task, you need a configuration of a cognitive map that keeps track of both the location in the physical 2D space of the maps and a binary location in this abstract space of left and right trials.
So, at the beginning of the training, we can expect that the car hasn't learned the nature of this task and is just exploring Los Angeles — its cognitive map only has spatial component encodings. But over the course of several runs, the car has built up the experience and learned that "Aha! I need to alternate the directions because the reward seems to be always located on the arm opposite of the previous trial." When this happens, the cognitive map is expanded with a new dimension and the mental representation of the city kind of splits into two cloned versions of it one for turn left trials and another one for right turns, and now all of a sudden you need to update your position in this expanded space which now has an additional axis.
Remarkably, we find cells whose firing is modulated by both the physical location and the direction of the future turn.
Such neurons are termed "Splitter cells". These cells uniquely encode the location in the fully expanded version of the cognitive map, which is remarkably consistent with the idea that the hippocampus keeps track of all task-relevant variables no matter how abstract they are.
This split dimension into left versus right trials is an example of a latent space since it is not directly observable from the sensory cues. There is no light switch that would signal you where to turn. Instead, you need to infer your location in the latent space based on previous observations.
Another example includes training our car to perform a run based on landmarks is known as a tower accumulation task. Essentially, as our car progresses, the model is presented with visual landmark cues indicating they need to choose whether to turn left or right. The direction of the reward is indicated by which side had the higher number of towers. So for example, if you encounter in total nine towers to your left and only seven to your right, it indicates that you need to turn left to get a reward since it has a higher number of cues. This number of towers, or better yet the difference between the two sides forms a latent evidence space, and it turns out that there are hippocampal neurons that form place fields in this latent space.
As you can see, location in latent spaces and even their pure existence in the cognitive map is extremely important. But the problem is, that it is not directly observable from individual sensory cues.
Instead, latent spaces are built from sequences of sensory observations. E.g., remembering that your previous choice will affect a future one.
Why bother with building such relational graphs in the first place?
Factorized Representations
At the beginning of this post, we saw an example of how humans can generalize quickly to other contexts. We have learned now, that this lateral move is possible because once the structured representation is built, it can be abstracted away from particular sensory observations. You know that 'north-west-south-east' at the same length will close a loop in any environment, be it your hometown or a new city you're visiting and such generalization effectively requires the existence of a factorized representation.
In our mammalian brain, we factorize every experience into its structural component which is
the position on this relational graph, and
Sensory component — the particular setting of the outside world.
The hippocampus then forms a conjunctive representation, unifying the two streams of information and embedding the particular sensory information into the structural backbone. This difference between factorized and conjunctive representations can be demonstrated by a famous phenomenon called hippocampal remapping.
But let’s keep it as this for now. My curiosity about how generalization for autonomous agents could work through copying the functionality of an artificial hippocampus.
Summary
Research in mammalian brains has provided evidence that both on the behavioral and cellular levels our brain must have an internal model of the world known as a cognitive map. Despite the word 'map' in the name, it is not restricted to representing physical space. Rather, cognitive maps, like graphs, are a systematic way to organize knowledge in some kind of structure. The main purpose of these representations is to effectively utilize the inherent regularities and repetitions of the outside world in order to minimize computational effort and generalize your knowledge.
I also talked about how such structural backbones can be viewed as organizing knowledge as a relational graph which needs to take into account latent spaces that are not directly observable but rather need to be inferred from sequences of sensory observations. The incorporation of latent spaces allows the hippocampus to keep track of abstract variables, such as the amount of sensory evidence in the tower accumulation task and the position in the space of left versus right trials in the alternation task.
Finally, I discussed why factorizing knowledge into structural and sensory components is useful to the brain since it allows it to generalize and make problems more computationally feasible. Evidence for such factorized representations can be found in the entorhinal cortex whose medial and lateral parts provide the hippocampus with two separate streams of information while the hippocampus then generates a unified, conjoined representation, embedding the structure into sensory context in order to solve particular behavioral tasks like driving a car in a new city.
I need more coffee.