Searchformer: Transformers Outperform A* Algorithm in Complex Planning Tasks
When will we hear more about Q* ? Sam, Ilya,..... anyone?
As discussed here previously, I also noticed that transformer-based models are not that great for reasoning and planning. Reasoning and planning are core capabilities that autonomous agents need to possess to solve a task and reach a defined goal effectively.
Planning as a capability has important applications in mobility/logistics, network topologies traversing, robotics, and gaming. All of these are industries where even minor optimizations can have multi-billion dollar impacts. A research team at Meta has recently published a paper that introduces a new method for training Transformers to solve complex planning tasks.
This new method is called Searchformer, a model that predicts the search dynamics of a planning algorithm called A*. In this context, search dynamics are the sequences of actions and states that A* explores to find the optimal solution. By predicting search dynamics, Searchformer has been shown to find the optimal path in fewer search steps.
Goal: How can a transformer model be used for finding the optimal path compared to traditional methods like the A* algorithm?
Problem: Despite the success of Transformer-based architectures in LLMs they still struggle with solving planning and reasoning tasks. Specifically, multi-step planning, complex decision-making tasks, and high-order reasoning are performing poorly. In addition, traditional methods also offer formal guarantees for solutions based on their pre-defined rule-based search procedures — important for the governance of such models.
Solution Searchformer is an encoder-decoder Transformer model trained to predict the search dynamics of A∗ . The team uses search dynamics bootstrapping, a method that first trains a Transformer model to imitate the search procedure of A∗ and then fine-tunes the model to find an optimal plan with fewer search steps. The model is trained on execution traces of deterministic and non-deterministic variants of the A* algorithm and also uses synthetic data to reach their goal.
Opinion I selected to review this paper because I initially found the problem area relevant for my own research as well as the solution using gaming data (Maze running, Sokoban) novel.
Based on the claims of the research team, their transformer model optimally solves previously unseen Sokoban puzzles 93.7% of the time, while using up to 26.8% fewer search steps than standard A∗ search. Their solution also robustly follows the execution trace of a symbolic planner and improves (in terms of trace length) beyond the human-crafted rule-based planning strategy it was initially trained on.
Initially, I found the idea to tokenize the path-finding algorithms for two tasks maze-running and Sokoban (a game where a crate has to be pushed to a goal), and then train a model on the execution traces of these algorithms quite novel, but in the end this approach doesn’t scale well, especially in exponentially growing maze complexities.
Links: Paper, Github, Score 5/10
Let’s dive in
As mentioned, the goal of this substack remains to build effective autonomous agents, so high-order reasoning is high on the list of problems to solve.
High-order reasoning
Refers to the cognitive ability to analyze, synthesize, and evaluate information at a complex and abstract level. It involves critical thinking, problem-solving, and decision-making beyond simple recall or basic understanding. In high-order reasoning, individuals engage in deeper levels of cognitive processing, drawing connections between concepts, recognizing patterns, and making inferences.
For example, in a business context, high-order reasoning can involve analyzing market trends, evaluating potential risks, and developing strategic plans based on a nuanced understanding of various factors. Like the investment bot, I am still working on. Reasoning is a crucial skill for navigating complex situations and addressing challenges that require more than just surface-level thinking. Reasoning includes the development of a plan toward a goal that can in many cases be represented as a traversal through a graph.
source: https://www.mdpi.com/2224-2708/11/4/78
The A* algorithm
This algorithm is a popular choice for pathfinding and graph traversal. A* is particularly effective in finding the shortest path in a graph from a starting point to a goal, where each edge has a cost associated with it. (Like food delivery in a city)
A* combines elements of two other algorithms: Dijkstra's algorithm and Greedy Best-First Search. It uses a heuristic to measure the cost of reaching the goal from the current node, along with the actual cost incurred from the start node to the current node. A good heuristic function is a key component of the A* algorithm. It provides an estimate of the remaining cost to reach the goal from a given node. A good heuristic ensures that the algorithm explores the most promising paths first, making it more efficient than pure Dijkstra's algorithm.
Here is a rough overview of how A* operates.
Initialization: Start with the initial node (start point) and add it to the open list. Set the cost from the start node to itself as 0.
Evaluate Nodes: While the open list is not empty, select the node with the lowest total cost (sum of the actual cost from the start node and the heuristic estimate to the goal) and move it to the closed list.
Goal Check: If the selected node is the goal, the algorithm terminates, and the path is reconstructed.
Expand Nodes: Expand the selected node by considering its neighbors. Calculate the total cost for each neighbor by summing the cost to reach the current node and the estimated cost from the neighbor to the goal. Add neighbors to the open list if they are not in the closed list or open list with a lower cost.
Loop (GOTO 2 —for all the fans of old school basic): Repeat steps 2-4 until the goal is reached or the open list is empty.
Execution traces
While A* operates, it generates execution traces, i.e., step-by-step records of the algorithm's actions during the process of finding a path from a start to a goal.
These traces illustrate how the algorithm expands nodes, computes costs, and makes decisions. Consider a grid with nodes representing different positions, and the goal is to find the shortest path from the start node (S) to the goal node (G). Numbers on each node represent the heuristic values (H) indicating the estimated cost to reach the goal.
The components of the state can be expressed as tokens, and the path the algorithm actually took as token sequences. The research team showed that all token sequences are synthetically generated and use a special vocabulary. These sequences are used to train models that are specifically trained to only predict sequences that outline optimal plans for a set of different planning tasks. After training, the models’ outputs are parsed and evaluated if they contain plans for a correct and optimal solution.
These execution traces can be symbolically represented to be used in symbolic planning.
What are Symbolic Planners?
Traditionally in transformer models Chain of Thought or Tree of Thought, i.e., applying a branching strategy and critiquing it to generate different thought paths before picking the best one, are methods used for path planning. However, both of these don’t seem to work well.
Symbolic planners in the context of the A* algorithm are systems that represent the state through symbols to plan and make decisions. Common symbolic representations used in the planning exercise are the world state, the actions, and the goal conditions with the ultimate goal of finding a sequence of actions that transforms the initial state into a goal state.
Probably the most well-known symbolic planning framework is the Stanford Research Institute Problem Solver (STRIPS).
A STRIPS instance is composed of:
An initial state;
The specification of the goal states – situations that the planner is trying to reach;
A set of actions. For each action, the following are included:
preconditions (what must be established before the action is performed);
postconditions (what is established after the action is performed).
A plan for such a planning instance is a sequence of operators that can be executed from the initial state and that leads to the desired goal state.
Model development
First, the research team developed a transformer model tasked to imitate A∗ search.
Run A* on a problem
Log the performed computation and optimal plan into a tokenized sequence of words.
This builds a training dataset then containing execution traces of A∗ and encoded information about the search dynamics of A∗ itself.
The Transformer model is then trained to generate token sequences along with an optimal plan for any given planning task.
Then, the research team applied expert iteration on this dataset. Expert iteration resulted in a neural planning algorithm that is implicitly encoded in the Transformer’s network weights.
Based on the validation results, this model finds an optimal plan with high probability in fewer search steps than standard A∗ search. Specifically, on the problem of solving Sokoban puzzles (which I found way more interesting than maze running), the model appeared to solve 93.7% of all test tasks while performing search steps that are on average 26.8% shorter than A∗ search.
As usual, I don’t comment on results until I have implemented it myself.
In conclusion
I didn’t particularly enjoy the Maze navigation solution, mainly because I am looking for real-world application problems and maze navigation is increasing exponentially in complexity. This is hard for humans as well and part of the fun of mazes is to get lost in them.
Of course, one could argue that maze navigation is a similar problem to finding the optimal path for autonomous driving, last-mile delivery, or similar logistics problems, but at the end of the day, in these scenarios usually, all information is completely available.
And I like the Sokoban exercise to be more fun.
In the end, the paper is understandable and tries to solve a real-world problem. However, I am not sure if the approach to tokenized intermediary algorithm states makes much sense. It feels intuitively a bit over-engineered. On the other hand, how else would you translate the logic of world states into an encoder/decoder setup?
If you have a ground-breaking idea you want to discuss let me know in the chat.
Hope you liked this review. Please like, share, subscribe.
Sokoban code after paywall