Reasoning In Autonomous Driving

How interactive agents, human-in-the-loop learning, and embodied reasoning can make autonomy in mobility safer and smarter.

May 11, 2025

Reasoning models will likely never make split-second decisions in live traffic. That doesn’t mean that reasoning agents can’t be useful for autonomous driving. Most AI that has been deployed in production is still quite static and can reason over the world model nor actually learn from new concepts or knowledge.

In this analysis, I am evaluating the findings of three recent research papers.

Interactive Double Deep Q-network: Integrating Human Interventions and Evaluative Predictions in Reinforcement Learning of Autonomous Driving
Autonomous Embodied Agents: When Robotics Meets Deep Learning Reasoning (GitHub)
DriveAgent: Multi-Agent Structured Reasoning with LLM and
Multimodal Sensor Fusion for Autonomous Driving (GitHub)

In context of these developments in autonomous driving research, we have seen some business deals lately that should effectively unlock budgets and markets for scaling autonomous agents businesses.

Pony.AI/Uber a new entry among a flurry of Uber Partnership Announcements
Waymo/Toyota Partnership Announcement and scaling US Operations
Tesla Robotaxi Analyst opinion, and hits trademark roadblock.

Listening to the regional BD Vice President Ann Shi of Pony.ai and Regional APAC GM for Uber Dom Taylor, at a conference in Tokyo on Friday May 9th, not much has changed since 2015/2016 when I was working on the mobility concepts for autonomous driving for Mercedes Benz. For Uber, as a ride-share company, it seems obvious why they chase this, as removing the human from the P2P logistic creation removes a significant cost and risk element. Autonomous driving has a scaling problem. It is not even remotely economically viable during training of sandboxed proof-of-concept demonstrations and will only have enough unit economics when tens of thousands of vehicles are deployed. Once that is achieved, though, the deployed vehicles will face road conditions and real-world situations the algorithms have not been trained on yet, increasing the likelihood of real-world consequences leading up to death.

Since that is to be avoided, what are good ways forward?

There are many things that can be improved, but I want to focus on these three problems:

How should we integrate human or agent expertise into Deep Reinforcement Learning models to enhance safety, adaptability, and interpretability
Annotated data for supervised learning in real-life settings is costly to produce (time and money). Having a localized training data set is definitely moat.
Simulation to reality transfer is hard. When is new information valuable enough to update the weights of the model?

Observations

Indoor navigation is easier than outdoor navigation, as all information for navigation can be assumed to be stable.
The need to learn something new depends on the episode length, i.e., the length of a trip, and also the mapped location. If the agent never moves, then learning must not be focused on the environment but only on new observations that caused errors or uncertainty.

Human in the Loop as Reasoning Model

Scaling autonomous systems to new cities means confronting unfamiliar and unpredictable conditions. You can’t simulate every edge case. And once deployed, these systems must operate safely. The solution lies in enabling agents to generalize beyond their training and adapt to novel situations.

This is where the first paper Interactive Double Deep Q-network enters the picture.

Clipped Double Q-Learning is a technique used to reduce overestimation bias in Q-learning by maintaining two Q-value networks, Q1 and Q2, and using the minimum of their estimates for target value calculation.

It works like this:

Two independent Q-networks Q1 and Q2 are trained simultaneously.
Use the max Q-value from a single network (which tends to overestimate) during target computation.
Both Q-networks are updated toward this clipped target, reducing bias and stabilizing training.

In addition, the paper implements a Dueling Architecture to separate state-value and advantage-value functions, and also Prioritized Experience Replay improves sample efficiency. Traditionally, Behavioral Cloning trains a policy to mimic expert demonstrations through supervised learning. It works in well-defined environments but lacks generalization when faced with unseen scenarios, i.e., the model cannot respond meaningfully. The paper proposes interactive learning methods like Deep Q-learning from Demonstrations (DQfD) and HG-DAgger to overcome these limits. These allow agents to explore while humans provide timely corrections, reducing risk and improving data quality during training.

Here is the algorithm in pseudocode

source - page 3

Interesting to note is that human/agent interventions are guided by agent uncertainty or elevated risk, and their impact is evaluated through predictive models that estimate future rewards and simulate alternative outcomes. Evaluator agents further support learning by estimating collision probabilities or identifying when new training is needed, helping the system focus on meaningful deviations.

This leads to a shift from monolithic policies to a modular, multi-agent setup where everyone in the value chain, from driver, evaluator, observer, and human, contributes to policy improvements!

source - page 6

Btw, the evaluation prediction model consists of a classifier and a predictor that together simulate what would have occurred if the human had NOT intervened. It does this by comparing the rewards of actions during the evaluation period. Overall, I think that result is a system that learns more safely, efficiently, and with greater contextual awareness. However, I think that reliable autonomy emerges when agents can recognize uncertainty and have structured ways to improve through interaction.

Maybe the PhD Thesis Autonomous Embodied Agents might help.

Embodied Reasoning for Impact and Curiosity

As mentioned above, it is a real challenge to transfer what the agent learns during simulation to the harsh realities of the physical world. Even subtle mismatches in dynamics or sensor fidelity can cause problems. This is where Embodied AI might be a solution. The paper proposes an agent with an embodied world model, sensors, and actuators needs to build a representation of the world that’s grounded in experience, Because of the sensors, these embodied agents receive friction and latency information, account for occlusion (rain, snow, fog, mud, or similar), and act within the constraints of real physics.

Classical robotics often optimizes for control or trajectory planning. I would understand Embodied AI being about ordered reasoning in context. That means the agent is not only navigating a maze, but because it knows why, the agent can adapt, reallocate attention, and synthesize meaning from sensor fusion. Even though I don’t see that happening in the medium term happening in real time.

Since this is a more general approach, the assumption of the paper is that transfer learning can be achieved faster and cheaper. But we are still far away from that. Reasoning doesn’t really exist for embodied AI Agents yet not only because the field is still nascent. But the thesis asks what if the lack of high-quality annotated data from real-world samples can be solved by having the agent use an intrinsic reward function so the agent can calculate its reward through its observations.

source - page 59

Or in other words

Impact: A method to reward actions that produce a significant change in the agent’s knowledge or internal representation of the environment.

Curiosity: The agent is encouraged to explore states of the environment where it can see or learn new things.

My initial understanding of this is :

Why do we sometimes talk to ourselves?

Talking to ourselves, also called self-talk, is a common cognitive behavior that serves several purposes:

Cognitive regulation: It helps us focus attention, plan, and problem-solve by externalizing thought processes.
Emotional control: Verbalizing feelings can regulate mood and reduce anxiety.
Simulation: We may rehearse conversations or decisions by simulating different perspectives internally.
Memory aid: Speaking aloud can reinforce memory encoding and retrieval.

In cognitive models and AI agents, self-talk can be likened to internal meta-cognition or simulated reasoning, where an agent:

Monitors and updates its internal state Like tracking goals, constraints, or beliefs.
Evaluates options or plans before acting, especially under uncertainty.
Uses inner speech as a control signal, Similar to humans verbalizing subgoals or decisions (e.g., “first I’ll move left, then…”).

In my opinion, while arguably it sounds silly, it might actually work because our current generation also relies on a ReAct pattern to reason through a query. Most current-gen pathfinding models optimize navigation for obvious economic reasons (scaling problem) by rewarding high certainty. That penalizes true exploration and limits true learning. Intrinsic rewards might offer a better path forward. Maybe together with digital twins, this technique would allow agents to learn directly from sensorimotor experience without hand-crafted maps or oracle supervision.

This method, proposed in this paper, approaches this by combining three major components: a CNN-based mapper, a pose estimator, and a hierarchical navigation policy.

source - page 64

Planner: A deterministic planner that uses the global goal to compute a local goal in close proximity of the agent. The deterministic planner adopts the A* algorithm to compute a feasible trajectory from the agent’s current position to the global goal using the current state of the map

Mapper: The mapper generates a map of the free and occupied regions of the environment discovered during the exploration.

Pose estimator: The pose estimator is used to predict the displacement of the agent as a consequence of an action.

Impact reward: Encourages actions that modify agent’s internal representation of the environment, with impact at defined timesteps.

Which leads to these results.

Ignoring the values, doesn’t it make you wonder if our emotions are actually important elements that guide our cognition? Might be a bit silly, but what if we can map emotions to an value-adding function of an agent?

In general, I think the thesis brings out some interesting concepts.

Drive Agent

This paper now approaches the same problem by proposing “DriveAgent”, a modular autonomous driving framework that integrates large language model (LLM) reasoning with multimodal sensor fusion.

source - page 3

As you can see, DriveAgent is structured around four task-specific modules:

Descriptive Analysis, which selects critical timestamps from sensor data using LLM-based motion analysis;
Vehicle Reasoning, where separate LiDAR and vision agents interpret vehicle status and are synthesized by an analyzer to detect anomalies;
Environmental Reasoning, which detects environmental changes and explains their causes; and
Response Generation, which merges insights into actionable driving decisions. DriveAgent handles heterogeneous sensor inputs—camera, LiDAR, GPS, and IMU—and structures reasoning using specialized agents coordinated by the LLM to deliver interpretable, timely responses in complex scenarios.

And I think specifically, the reasoning components are interesting

Vehicle Reasoning

Vehicle reasoning operates through dedicated LiDAR and vision agents that interpret sensor data independently.
An aggregator agent compares outputs from both sensors to detect inconsistencies or anomalies in vehicle behavior, such as misalignment or irregular motion.
This layered reasoning ensures robust diagnosis even when one sensor modality is degraded or ambiguous.

Environmental Reasoning

Environmental reasoning uses sensor data to detect and analyze changes in surroundings across timestamps.
It employs a causal analysis agent to explain why changes occurred and flags environmental factors (e.g., obstacles, dynamic hazards) requiring heightened attention.
These contextual insights enrich decision-making, especially in dynamic urban environments.

Here are some results of this compared to the human instruction in scene understanding. Human training is much more expensive to source.

source - page 7

I think also drive agent walks in a similar direction using reasoning to make sense of what the agent is observing.

In conclusion

Maybe in the past you might have wondered why I have “sensor” frameworks in my agent definition. Maybe this post makes it clearer. In general, I think that the industry used to go in the wrong direction, choosing urban peer-to-peer autonomous mobility first. With the work pony.ai is doing, we might get technically ahead towards that goal, but I still believe that the bigger value proposition is rural truck deliveries.

When we think about autonomous agents, we should build a world where this technology improves the quality of life.

What do you think?

Encyclopedia Autonomica

Discussion about this post